Predictive analytics: Obliterated

Thoughts

Questions

  • What deficit are we talking about? Is it covered by the mRS?
  • Should the pre-treatment mRS be in the model? Already covered by the others?

Backlog

  • Honest estimation
  • G estimation (using only significant ones to get accurate ORs)
  • Make sure that the number of outcomes (26) is valid.
  • Bootstrap the whole process - how many models include each of the features?
  • The SuSiE method: https://www.biorxiv.org/content/10.1101/501114v1
  • Choose best model with forward/backward exclusion
  • Feature selection with honest estimation + p-value-based + bootstrap
  • Feature selection with honest estimation + LASSO-based + bootstrap
  • Implement PCA to create independent variables and not drop variables
  • Bayesian fitting
  • Change to use tidybayes and tidymodels
  • Use neural networks to fit
  • Consider a time-to-event analysis

Setup

Imports

# Load packages
library(glmnet)
library(magrittr)
library(selectiveInference)
library(tidyverse)

# Data
# https://office365stanford-my.sharepoint.com/:x:/r/personal/simonlev_stanford_edu/_layouts/15/Doc.aspx?sourcedoc=%7BCE7DAF19-8F1A-43B0-866D-3D0F36336EDC%7D&file=Peds%20AVM%20Database%20July%2010%202024%20Onwards.xlsx&fromShare=true&action=default&mobileredirect=true

# Source code

Configurations

# Source outcome-specific configs
source(params$CONFIG)

# Destination locations
DST_DIRNAME <- paste0("../../outputs/")

# Source locations
SRC_DIRNAME <- "../../data/3_tidy/patients"
SRC_BASENAME <- "patients-daily_2024-07-21.csv"

# Should the Dockerfile be automatically updated?
UPDATE_DOCKERFILE <- FALSE

# Set the seed
set.seed(1891)

Parameters

EXPOSURES_CONTINUOUS <- c(
  "Age at 1st treatment (years)" = "age_at_first_treatment_yrs",
  "mRS (presentation)" = "modified_rankin_score_presentation",
  "mRS (pre-treatment)" = "modified_rankin_score_pretreatment",
  "mRS (1-week post-op)" = "modified_rankin_score_postop_within_1_week",
  "mRS (final)" = "modified_rankin_score_final",
  "Nidus size (cm)" = "max_size_cm",
  "Spetzler-Martin grade" = "spetzler_martin_grade",
  "Lawton-Young grade" = "lawton_young_grade",
  "Size score" = "size_score"
)

EXPOSURES_BINARY <- c(
  "Sex (Male)" = "is_male",
  "Involves eloquent location" = "is_eloquent_location",
  "Has deep venous drainage" = "has_deep_venous_drainage",
  "Diffuse nidus" = "is_diffuse_nidus",
  "Hemorrhage at presentation" = "has_hemorrhage",
  "Seizures at presentation" = "has_seizures",
  "Deficit at presentation" = "has_deficit",
  "Paresis at presentation" = "has_paresis",
  "Presence of aneurysms" = "has_associated_aneurysm",
  "Spetzler-Martin grade < 4" = "is_spetzler_martin_grade_less_than_4",
  "Had surgery" = "is_surgery"
)

EXPOSURES_CATEGORICAL <- c(
  "Location" = "location",
  "Venous drainage" = "venous_drainage",
  "Modality of treatment" = "procedure_combinations"
)

OUTCOMES <- c(
  "Poor outcome (mRS >= 3)" = "is_poor_outcome",
  "Obliteration" = "is_obliterated",
  "Complications - minor" = "has_complication_minor",
  "Complications - major" = "has_complication_major",
  "mRS change (final - pre-treatment)" =
    "modified_rankin_score_final_minus_pretx",
  "mRS change (final - presentation)" =
    "modified_rankin_score_final_minus_presentation",
  "mRS change direction (Final - Pre-tx)" = "change_in_mrs_tx_vs_final"
)

Outputs

Create lists.

plots <- list()
tables <- list()
models <- list()

Functions

R utils.

Data analysis utils.


Read

# File path
filepath <- file.path(SRC_DIRNAME, SRC_BASENAME)

# Read
patients_ <-
  read_csv(filepath) %>%
  dplyr::sample_frac(params$PROPORTION_OF_DATA)

# If analyzing a subset of the data, restrict the dataset
if (params$TITLE == "Poor outcome - Hemorrhage") {
  patients_ <- patients_ %>% filter(has_hemorrhage)
}
if (params$TITLE == "Poor outcome - No Hemorrhage") {
  patients_ <- patients_ %>% filter(!has_hemorrhage)
}

Conform

Setup

Remove all empty rows.

patients_ <-
  patients_ %>%
  filter(!is.na(patient_id)) %>%
  arrange(patient_id)

Create two dataframes - one for univariable and one for multivariable analysis. This is because variables for multivariable analysis will be recoded to reduce the number of levels to avoid overfitting the model.

df_uni <-
  patients_ %>%
  filter(is_eligible)

df_multi <-
  patients_ %>%
  filter(is_eligible)

Recode columns

For venous_drainage if “Both”, mark as “Deep.”

df_multi <-
  df_multi %>%
  mutate(
    venous_drainage = ifelse(venous_drainage == "Both", "Deep", venous_drainage)
  )

For procedure_combinations, change > 1 to multimodal to reduce levels.

df_multi <-
  df_multi %>%
  mutate(procedure_combinations = case_when(
    nchar(procedure_combinations) > 1 ~ "Multimodal",
    .default = procedure_combinations
  ))

For procedure_combinations, indicate if surgery-based or not.

df_multi <-
  df_multi %>%
  mutate(
    is_surgery = str_detect(procedure_combinations, "S")
  ) %>%
  relocate(is_surgery, .after = procedure_combinations)

df_uni <-
  df_uni %>%
  mutate(
    is_surgery = str_detect(procedure_combinations, "S")
  ) %>%
  relocate(is_surgery, .after = procedure_combinations)

For location, reduce choices (use “Other”).

df_multi <-
  df_multi %>%
  mutate(
    location = ifelse(location == "Corpus Callosum", "Other", location),
    location = ifelse(location == "Cerebellum", "Other", location),
    # location = ifelse(location == "Deep", "Other", location),
    location = factor(location),
    location = relevel(location, ref = "Other")
  )

For spetzler_martin_grade, create a binary variant of 1-3 vs. 4-5.

# For multivariable analysis
df_multi <-
  df_multi %>%
  mutate(
    is_spetzler_martin_grade_less_than_4 = spetzler_martin_grade < 4
  ) %>%
  relocate(is_spetzler_martin_grade_less_than_4, .after = spetzler_martin_grade)

# For univariable analysis
df_uni <-
  df_uni %>%
  mutate(
    is_spetzler_martin_grade_less_than_4 = spetzler_martin_grade < 4
  ) %>%
  relocate(is_spetzler_martin_grade_less_than_4, .after = spetzler_martin_grade)

Missingness

# Get cols
cols <- unname(c(
  EXPOSURES_CONTINUOUS,
  EXPOSURES_BINARY,
  EXPOSURES_CATEGORICAL,
  OUTCOMES
))

# Count missingness
df_multi %>%
  select(cols) %>%
  summarise_all(~ sum(is.na(.))) %>%
  pivot_longer(everything(), values_to = "num_missing") %>%
  arrange(desc(num_missing)) %>%
  filter(num_missing > 0) %>%
  sable()
name num_missing
lawton_young_grade 10
spetzler_martin_grade 8
is_spetzler_martin_grade_less_than_4 8
size_score 6
has_associated_aneurysm 6
max_size_cm 5
is_eloquent_location 5
has_deep_venous_drainage 4
modified_rankin_score_pretreatment 3
modified_rankin_score_postop_within_1_week 3
is_surgery 3
venous_drainage 3
procedure_combinations 3
modified_rankin_score_final_minus_pretx 3
has_complication_minor 2
has_complication_major 2
modified_rankin_score_final 1
is_male 1
is_diffuse_nidus 1
location 1
is_poor_outcome 1
modified_rankin_score_final_minus_presentation 1
change_in_mrs_tx_vs_final 1

Which eligible patients are missing each variable?

df_multi %>%
  filter(is_eligible) %>%
  select(mrn, cols) %>%
  mutate(across(-mrn, is.na)) %>%
  pivot_longer(
    cols = -mrn,
    names_to = "name",
    values_to = "is_missing"
  ) %>%
  filter(is_missing) %>%
  group_by(name) %>%
  summarize(mrn = str_c(mrn, collapse = ", ")) %>%
  sable()
name mrn
change_in_mrs_tx_vs_final 60860434
has_associated_aneurysm 60233426, 60310463, 60303054, 61011961, 62696083, 62609300
has_complication_major 60860434, 62023254
has_complication_minor 60860434, 62023254
has_deep_venous_drainage 60233426, 62609300, 42446310, 81981102
is_diffuse_nidus 15871379
is_eloquent_location 62728597, 62696083, 62609300, 42446310, 81981102
is_male 19399469
is_poor_outcome 60860434
is_spetzler_martin_grade_less_than_4 60233426, 60303054, 60860434, 62728597, 62696083, 62609300, 42446310, 81981102
is_surgery 49659493, 46166047, 62609300
lawton_young_grade 60233426, 60310463, 60303054, 60860434, 61011961, 62728597, 62696083, 62609300, 42446310, 81981102
location 60310463
max_size_cm 60233426, 60303054, 60860434, 62609300, 81981102
modified_rankin_score_final 60860434
modified_rankin_score_final_minus_presentation 60860434
modified_rankin_score_final_minus_pretx 60860434, 49659493, 46166047
modified_rankin_score_postop_within_1_week 60860434, 49659493, 46166047
modified_rankin_score_pretreatment 60860434, 49659493, 46166047
procedure_combinations 49659493, 46166047, 62609300
size_score 60233426, 60303054, 60860434, 62609300, 42446310, 81981102
spetzler_martin_grade 60233426, 60303054, 60860434, 62728597, 62696083, 62609300, 42446310, 81981102
venous_drainage 60233426, 62609300, 81981102

Visualizations

Overall

Distribution of values of the exposures across levels of the outcome.

# Define the predictors of interest
cols <- c(
  EXPOSURES_CONTINUOUS,
  EXPOSURES_BINARY,
  CHOSEN_OUTCOME
)

# Reshape the data
df_long <-
  df_uni %>%
  select(all_of(cols)) %>%
  pivot_longer(-CHOSEN_OUTCOME, names_to = "predictor", values_to = "value")

# Plot the box plots
df_long %>%
  ggplot(aes_string(
    x = CHOSEN_OUTCOME,
    y = "value",
    fill = CHOSEN_OUTCOME,
    color = CHOSEN_OUTCOME
  )) +
  geom_boxplot(alpha = 0.5) +
  facet_wrap(~predictor, scales = "free") +
  labs(x = "Outcome", y = "Value", fill = "Outcome") +
  theme_minimal() +
  theme(
    axis.text = element_text(size = 10),
    axis.title = element_text(size = 11),
    strip.text = element_text(size = 11),
    legend.position = "none"
  )

By hemorrhage

# Define the predictors of interest
cols <- c(
  EXPOSURES_CONTINUOUS,
  EXPOSURES_BINARY,
  CHOSEN_OUTCOME
)

# Reshape the data
df_long <-
  df_uni %>%
  select(all_of(cols)) %>%
  pivot_longer(
    cols = -c(CHOSEN_OUTCOME, `Hemorrhage at presentation`),
    names_to = "predictor",
    values_to = "value"
  )

# Plot the box plots
df_long %>%
  ggplot(aes_string(
    x = CHOSEN_OUTCOME,
    y = "value",
    fill = "`Hemorrhage at presentation`",
    color = "`Hemorrhage at presentation`"
  )) +
  geom_boxplot(alpha = 0.5) +
  facet_wrap(~predictor, scales = "free") +
  labs(x = "Outcome", y = "Value", fill = "Outcome") +
  theme_minimal() +
  guides(fill = "none") +
  theme(
    axis.text = element_text(size = 10),
    axis.title = element_text(size = 11),
    strip.text = element_text(size = 11)
  )


Descriptive statistics

Setup

Descriptive statistics - continuous variables.

compute_descriptive_continuous <- function(
    df = df_uni,
    cols = c(EXPOSURES_CONTINUOUS),
    use_interaction = TRUE,
    interaction_col = "has_hemorrhage",
    interaction_col_name = "Hemorrhage at presentation",
    interaction_col_value = TRUE,
    prefix = "Haem") {
  # Apply desired formatting to numbers
  format_numbers <- function(x) {
    sprintf("%.1f", x)
  }

  # Compute new dataset
  df <-
    df %>%
    select(cols, CHOSEN_OUTCOME, interaction_col) %>%
    drop_na(CHOSEN_OUTCOME, interaction_col)

  # Apply interaction
  if (use_interaction) {
    df <-
      df %>%
      filter(!!sym(interaction_col) == interaction_col_value)
  }

  # Compute descriptive statistics
  desc_stats <-
    df %>%
    pivot_longer(where(is.numeric), names_to = "keys", values_to = "values") %>%
    group_by(!!sym(CHOSEN_OUTCOME), keys) %>%
    summarize(
      num_total = n(),
      num_missing = sum(is.na(values)),
      mean = mean(values, na.rm = TRUE),
      sd = sd(values, na.rm = TRUE),
      min = quantile(values, 0, na.rm = TRUE),
      max = quantile(values, 1, na.rm = TRUE),
      q_50 = quantile(values, 0.50, na.rm = TRUE),
      q_25 = quantile(values, 0.25, na.rm = TRUE),
      q_75 = quantile(values, 0.75, na.rm = TRUE)
    ) %>%
    ungroup() %>%
    mutate(
      stats = glue::glue("{format_numbers(mean)} ({format_numbers(sd)})"),
      !!sym(CHOSEN_OUTCOME) := factor(
        !!sym(CHOSEN_OUTCOME),
        levels = CHOSEN_OUTCOME_LEVELS,
        labels = CHOSEN_OUTCOME_LABELS
      )
    ) %>%
    pivot_wider(
      id_cols = keys,
      names_from = !!sym(CHOSEN_OUTCOME),
      values_from = stats
    ) %>%
    relocate(CHOSEN_OUTCOME_LABELS[2], .after = CHOSEN_OUTCOME_LABELS[1])

  # Compute p-values
  pvals <-
    df %>%
    pivot_longer(where(is.numeric), names_to = "keys", values_to = "values") %>%
    group_by(keys) %>%
    summarize(
      pvalue = wilcox.test(values ~ !!sym(CHOSEN_OUTCOME))$p.value
    )

  # Synthesize
  out <-
    desc_stats %>%
    left_join(pvals)

  # Add sample size to column names
  num_with_outcome <- sum(df %>% pull(CHOSEN_OUTCOME))
  num_no_outcome <- sum(!df %>% pull(CHOSEN_OUTCOME))

  new_col_names <- c(
    glue::glue("{prefix} - {CHOSEN_OUTCOME_LABELS[2]} (n={num_with_outcome})"),
    glue::glue("{prefix} - {CHOSEN_OUTCOME_LABELS[1]} (n={num_no_outcome})"),
    glue::glue("{prefix} - P-value")
  )

  out <-
    out %>%
    rename_with(
      .cols = !!sym(CHOSEN_OUTCOME_LABELS[1]):pvalue, ~new_col_names
    ) %>%
    mutate(
      keys = ifelse(keys == interaction_col, interaction_col_name, keys)
    )

  return(out)
}

Descriptive statistics - binary variables.

compute_descriptive_binary <- function(
    df = df_uni,
    cols = c(EXPOSURES_BINARY),
    use_interaction = TRUE,
    interaction_col = "has_hemorrhage",
    interaction_col_name = "Hemorrhage at presentation",
    interaction_col_value = TRUE,
    prefix = "Haem") {
  # Define arguments
  cols <- c(cols[!cols == interaction_col])

  # Apply desired formatting to numbers
  format_numbers <- function(x, decimals = 0) {
    decimals <- paste0("%.", decimals, "f")
    sprintf(decimals, x)
  }

  # Compute new dataset
  df <-
    df %>%
    select(all_of(cols), CHOSEN_OUTCOME, interaction_col) %>%
    drop_na(CHOSEN_OUTCOME, interaction_col)

  # Apply interaction
  if (use_interaction) {
    df <-
      df %>%
      filter(!!sym(interaction_col) == interaction_col_value)
  }

  # Compute descriptive statistics
  desc_stats <-
    df %>%
    pivot_longer(
      cols = -c(CHOSEN_OUTCOME),
      names_to = "keys",
      values_to = "values"
    ) %>%
    group_by(!!sym(CHOSEN_OUTCOME), keys) %>%
    summarize(
      num_total = n(),
      num_missing = sum(is.na(values)),
      num_with = sum(values, na.rm = TRUE),
      num_without = sum(!values, na.rm = TRUE),
      pct_with = mean(values, na.rm = TRUE)
    ) %>%
    ungroup() %>%
    mutate(
      stats = glue::glue("{num_with} ({format_numbers(pct_with*100)}%)"),
      !!sym(CHOSEN_OUTCOME) := factor(
        !!sym(CHOSEN_OUTCOME),
        levels = CHOSEN_OUTCOME_LEVELS,
        labels = CHOSEN_OUTCOME_LABELS
      )
    ) %>%
    pivot_wider(
      id_cols = keys,
      names_from = !!sym(CHOSEN_OUTCOME),
      values_from = stats
    ) %>%
    relocate(CHOSEN_OUTCOME_LABELS[1], .after = CHOSEN_OUTCOME_LABELS[2])

  # Compute p-values
  pvals <-
    df %>%
    pivot_longer(
      cols = -c(CHOSEN_OUTCOME),
      names_to = "keys",
      values_to = "values"
    ) %>%
    group_by(keys) %>%
    summarize(
      pvalue = kruskal.test(values ~ !!sym(CHOSEN_OUTCOME))$p.value
    )

  # Synthesize
  out <-
    desc_stats %>%
    left_join(pvals)

  # Prettify
  num_with_outcome <- sum(df %>% pull(CHOSEN_OUTCOME))
  num_no_outcome <- sum(!df %>% pull(CHOSEN_OUTCOME))

  new_col_names <- c(
    glue::glue("{prefix} - {CHOSEN_OUTCOME_LABELS[2]} (n={num_with_outcome})"),
    glue::glue("{prefix} - {CHOSEN_OUTCOME_LABELS[1]} (n={num_no_outcome})"),
    glue::glue("{prefix} - P-value")
  )

  out <-
    out %>%
    rename_with(
      .cols = !!sym(CHOSEN_OUTCOME_LABELS[2]):pvalue, ~new_col_names
    ) %>%
    mutate(
      keys = ifelse(keys == interaction_col, interaction_col_name, keys)
    )

  return(out)
}

Non-specific

# Rename dataframe
`Pediatric AVMs` <-
  df_uni %>%
  filter(is_eligible) %>%
  select(-is_eligible, -comments)

# Create summary
`Pediatric AVMs` %>%
  summarytools::dfSummary(display.labels = FALSE) %>%
  print(
    file =
      "../../outputs/descriptive-statistics/descriptive_statistics_eligible.html",
    footnote = NA
  )

# Remove unwanted dataframe
remove(`Pediatric AVMs`)

By hemorrhage - table1 (WIP)

https://cran.r-project.org/web/packages/table1/vignettes/table1-examples.html

library(table1)

table1::table1(
  ~ factor(is_male) + age_at_first_treatment_yrs + factor(is_eloquent_location) + max_size_cm | has_hemorrhage,
  data = df_uni
)
# Initialize values
df <- df_uni

# Remove missing values in stratification variables
df <-
  df %>%
  drop_na(has_hemorrhage, CHOSEN_OUTCOME)

# Define columns
cols <- c(
  EXPOSURES_CONTINUOUS,
  EXPOSURES_BINARY,
  OUTCOMES[OUTCOMES == CHOSEN_OUTCOME]
)

# Assign labels to the variables in the dataframe
for (var in cols) {
  label(df[[var]]) <- names(cols[cols == var])
}

# Assign meaningful labels to the has_hemorrhage variable
df$has_hemorrhage <- factor(
  df$has_hemorrhage,
  levels = c(FALSE, TRUE),
  labels = c("No Hemorrhage", "Has Hemorrhage")
)

df[, CHOSEN_OUTCOME] <- factor(
  df %>% pull(CHOSEN_OUTCOME),
  levels = CHOSEN_OUTCOME_LEVELS,
  labels = CHOSEN_OUTCOME_LABELS
)

# Define custom rendering functions if needed
render_continuous <- function(x) {
  with(stats.apply.rounding(stats.default(x), digits = 2), c(
    "Mean (SD)" = sprintf("%s (&plusmn; %s)", MEAN, SD),
    "Median (IQR)" = sprintf("%s (%s - %s)", MEDIAN, Q1, Q3)
  ))
}

compute_pvalues <- function(x, ...) {
  # Construct vectors of data y, and groups (strata) g
  y <- unlist(x)
  g <- factor(rep(1:length(x), times = sapply(x, length)))
  if (is.numeric(y)) {
    # For numeric variables, perform a standard 2-sample t-test
    p <- t.test(y ~ g)$p.value
  } else {
    # For categorical variables, perform a chi-squared test of independence
    p <- chisq.test(table(y, g))$p.value
  }
  # Format the p-value, using an HTML entity for the less-than sign.
  # The initial empty string places the output on the line below the variable label.
  c("", sub("<", "&lt;", format.pval(p, digits = 3, eps = 0.001)))
}

# Create the descriptive table
frla <- as.formula(paste("~ . | has_hemorrhage *", CHOSEN_OUTCOME))

# Create table
table1_descriptive_statistics <-
  table1(frla,
    data = df[, unname(cols)],
    render.continuous = render_continuous,
    overall = c(left = "Overall"),
    topclass = "Rtable1-zebra"
  )

# Print
table1_descriptive_statistics

By hemorrhage

# Define arguments
interaction_col <- "has_hemorrhage"
interaction_col_name <- "Hemorrhage at presentation"
prefix_true <- "Haem"
prefix_false <- "No Haem"

# Get all
out <- list()
out[[1]] <- compute_descriptive_continuous(
  use_interaction = F,
  interaction_col = interaction_col,
  interaction_col_name = interaction_col_name,
  interaction_col_value = T,
  prefix = "All"
)
out[[2]] <- compute_descriptive_binary(
  use_interaction = F,
  interaction_col = interaction_col,
  interaction_col_name = interaction_col_name,
  interaction_col_value = T,
  prefix = "All"
)
out_all <- bind_rows(out)

# Get with hemorrhage
out <- list()
out[[1]] <- compute_descriptive_continuous(
  use_interaction = T,
  interaction_col = interaction_col,
  interaction_col_name = interaction_col_name,
  interaction_col_value = T,
  prefix = prefix_true
)
out[[2]] <- compute_descriptive_binary(
  use_interaction = T,
  interaction_col = interaction_col,
  interaction_col_name = interaction_col_name,
  interaction_col_value = T,
  prefix = prefix_true
)
out_haem <- bind_rows(out)

# Get without hemorrhage
out <- list()
out[[1]] <- compute_descriptive_continuous(
  use_interaction = T,
  interaction_col = interaction_col,
  interaction_col_name = interaction_col_name,
  interaction_col_value = F,
  prefix = prefix_false
)
out[[2]] <- compute_descriptive_binary(
  use_interaction = T,
  interaction_col = interaction_col,
  interaction_col_name = interaction_col_name,
  interaction_col_value = F,
  prefix = prefix_false
)
out_no_haem <- bind_rows(out)

# Synthesize
out_complete <-
  out_all %>%
  left_join(out_haem, by = "keys") %>%
  left_join(out_no_haem, by = "keys") %>%
  arrange(`All - P-value`)

# Print
out_complete %>%
  sable()
keys All - Obliterated (n=184) All - Not obliterated (n=52) All - P-value Haem - Obliterated (n=122) Haem - Not obliterated (n=22) Haem - P-value No Haem - Obliterated (n=62) No Haem - Not obliterated (n=30) No Haem - P-value
Had surgery 128 (71%) 10 (19%) 0.000 91 (76%) 3 (14%) 0.000 37 (60%) 7 (23%) 0.001
Lawton-Young grade 5.7 (1.5) 4.0 (1.4) 0.000 5.2 (1.4) 3.7 (1.3) 0.000 6.1 (1.4) 4.7 (1.4) 0.000
Spetzler-Martin grade 3.8 (1.1) 2.6 (1.2) 0.000 3.6 (1.0) 2.5 (1.1) 0.000 3.9 (1.1) 2.8 (1.2) 0.000
Nidus size (cm) 5.1 (3.0) 3.0 (1.8) 0.000 4.3 (2.9) 2.7 (1.8) 0.018 5.6 (3.0) 3.4 (1.8) 0.000
Spetzler-Martin grade < 4 138 (78%) 21 (42%) 0.000 95 (81%) 11 (55%) 0.010 43 (70%) 10 (33%) 0.001
Size score 2.1 (0.8) 1.5 (0.7) 0.000 1.8 (0.8) 1.5 (0.6) 0.069 2.4 (0.7) 1.7 (0.7) 0.000
Involves eloquent location 96 (54%) 46 (88%) 0.000 56 (48%) 22 (100%) 0.000 40 (65%) 24 (80%) 0.132
Diffuse nidus 15 (8%) 16 (31%) 0.000 10 (8%) 6 (27%) 0.009 5 (8%) 10 (33%) 0.003
mRS (final) 1.8 (1.8) 0.8 (1.0) 0.000 2.2 (1.9) 0.9 (1.0) 0.001 1.5 (1.7) 0.6 (0.9) 0.019
Hemorrhage at presentation 122 (66%) 22 (42%) 0.002 122 (100%) 22 (100%) 0 (0%) 0 (0%)
mRS (presentation) 1.8 (1.6) 2.5 (1.8) 0.013 2.8 (1.8) 3.2 (1.7) 0.272 1.1 (1.0) 1.1 (1.0) 0.975
Has deep venous drainage 91 (51%) 36 (69%) 0.017 68 (57%) 16 (73%) 0.173 23 (38%) 20 (67%) 0.010
Deficit at presentation 61 (33%) 26 (50%) 0.026 42 (34%) 11 (50%) 0.165 19 (31%) 15 (50%) 0.073
Presence of aneurysms 33 (18%) 13 (27%) 0.199 27 (22%) 7 (33%) 0.286 6 (10%) 6 (21%) 0.139
mRS (pre-treatment) 1.5 (1.4) 1.7 (1.5) 0.411 2.1 (1.6) 2.0 (1.6) 0.787 1.0 (0.9) 0.9 (0.9) 0.773
Paresis at presentation 45 (24%) 15 (29%) 0.522 33 (27%) 9 (41%) 0.190 12 (19%) 6 (20%) 0.942
Sex (Male) 96 (52%) 29 (57%) 0.553 66 (54%) 11 (50%) 0.724 30 (48%) 18 (62%) 0.226
Age at 1st treatment (years) 11.1 (3.7) 11.3 (3.9) 0.687 11.1 (3.1) 10.9 (3.6) 0.806 11.1 (4.1) 12.1 (4.3) 0.183
Seizures at presentation 45 (24%) 14 (27%) 0.717 25 (20%) 7 (32%) 0.241 20 (32%) 7 (23%) 0.381
mRS (1-week post-op) 1.7 (1.5) 1.7 (1.3) 0.946 2.3 (1.6) 1.9 (1.4) 0.351 1.4 (1.3) 1.2 (1.1) 0.688

By diffuse nidus

# Define arguments
interaction_col <- "is_diffuse_nidus"
interaction_col_name <- "Is diffuse nidus"
prefix_true <- "Diffuse"
prefix_false <- "Not Diffuse"

# Get all
out <- list()
out[[1]] <- compute_descriptive_continuous(
  use_interaction = F,
  interaction_col = interaction_col,
  interaction_col_name = interaction_col_name,
  interaction_col_value = T,
  prefix = "All"
)
out[[2]] <- compute_descriptive_binary(
  use_interaction = F,
  interaction_col = interaction_col,
  interaction_col_name = interaction_col_name,
  interaction_col_value = T,
  prefix = "All"
)
out_all <- bind_rows(out)

# Get with hemorrhage
out <- list()
out[[1]] <- compute_descriptive_continuous(
  use_interaction = T,
  interaction_col = interaction_col,
  interaction_col_name = interaction_col_name,
  interaction_col_value = T,
  prefix = prefix_true
)
out[[2]] <- compute_descriptive_binary(
  use_interaction = T,
  interaction_col = interaction_col,
  interaction_col_name = interaction_col_name,
  interaction_col_value = T,
  prefix = prefix_true
)
out_haem <- bind_rows(out)

# Get without hemorrhage
out <- list()
out[[1]] <- compute_descriptive_continuous(
  use_interaction = T,
  interaction_col = interaction_col,
  interaction_col_name = interaction_col_name,
  interaction_col_value = F,
  prefix = prefix_false
)
out[[2]] <- compute_descriptive_binary(
  use_interaction = T,
  interaction_col = interaction_col,
  interaction_col_name = interaction_col_name,
  interaction_col_value = F,
  prefix = prefix_false
)
out_no_haem <- bind_rows(out)

# Synthesize
out_complete <-
  out_all %>%
  left_join(out_haem, by = "keys") %>%
  left_join(out_no_haem, by = "keys") %>%
  arrange(`All - P-value`)

# Print
out_complete %>%
  sable()
keys All - Obliterated (n=183) All - Not obliterated (n=52) All - P-value Diffuse - Obliterated (n=15) Diffuse - Not obliterated (n=16) Diffuse - P-value Not Diffuse - Obliterated (n=168) Not Diffuse - Not obliterated (n=36) Not Diffuse - P-value
Had surgery 128 (71%) 10 (19%) 0.000 11 (73%) 4 (25%) 0.008 117 (71%) 6 (17%) 0.000
Lawton-Young grade 5.7 (1.5) 4.0 (1.4) 0.000 7.1 (1.0) 6.3 (1.2) 0.091 5.1 (1.2) 3.8 (1.2) 0.000
Spetzler-Martin grade 3.8 (1.1) 2.6 (1.2) 0.000 4.6 (0.6) 4.1 (1.0) 0.134 3.4 (1.0) 2.5 (1.1) 0.000
Nidus size (cm) 5.1 (3.0) 3.0 (1.8) 0.000 7.4 (3.4) 5.3 (2.2) 0.091 4.0 (2.0) 2.8 (1.7) 0.000
Spetzler-Martin grade < 4 137 (77%) 21 (42%) 0.000 4 (27%) 1 (6%) 0.129 133 (82%) 20 (59%) 0.003
Size score 2.1 (0.8) 1.6 (0.7) 0.000 2.6 (0.6) 2.3 (0.7) 0.224 1.9 (0.8) 1.5 (0.6) 0.001
Involves eloquent location 95 (53%) 46 (88%) 0.000 13 (87%) 16 (100%) 0.137 82 (50%) 30 (83%) 0.000
Is diffuse nidus 15 (8%) 16 (31%) 0.000 15 (100%) 16 (100%) 0 (0%) 0 (0%)
mRS (final) 1.8 (1.8) 0.8 (1.0) 0.000 2.6 (1.8) 1.0 (0.9) 0.008 1.4 (1.7) 0.8 (1.0) 0.083
Hemorrhage at presentation 122 (67%) 22 (42%) 0.001 10 (67%) 6 (38%) 0.110 112 (67%) 16 (44%) 0.013
mRS (presentation) 1.8 (1.6) 2.5 (1.8) 0.013 2.0 (1.6) 2.0 (1.6) 0.983 1.7 (1.6) 2.5 (1.8) 0.013
Has deep venous drainage 91 (51%) 36 (69%) 0.019 13 (87%) 14 (88%) 0.946 78 (48%) 22 (61%) 0.142
Deficit at presentation 60 (33%) 26 (50%) 0.023 7 (47%) 11 (69%) 0.221 53 (32%) 15 (42%) 0.244
Presence of aneurysms 32 (18%) 13 (27%) 0.173 4 (27%) 6 (38%) 0.526 28 (17%) 7 (21%) 0.561
mRS (pre-treatment) 1.5 (1.4) 1.7 (1.5) 0.426 1.9 (1.6) 1.7 (1.3) 0.967 1.3 (1.3) 1.7 (1.5) 0.228
Paresis at presentation 44 (24%) 15 (29%) 0.482 5 (33%) 7 (44%) 0.558 39 (23%) 8 (22%) 0.898
Sex (Male) 96 (52%) 29 (57%) 0.578 6 (40%) 10 (62%) 0.218 90 (54%) 19 (54%) 0.939
Age at 1st treatment (years) 11.1 (3.7) 11.3 (3.9) 0.649 10.2 (3.6) 10.9 (3.9) 0.632 11.5 (3.7) 11.4 (3.9) 0.816
Seizures at presentation 45 (25%) 14 (27%) 0.733 6 (40%) 5 (31%) 0.617 39 (23%) 9 (25%) 0.819
mRS (1-week post-op) 1.7 (1.5) 1.7 (1.3) 0.931 2.2 (1.6) 1.6 (1.1) 0.312 1.5 (1.4) 1.7 (1.3) 0.441

Associations

Correlation matrix.

# Create new dataframe
df_ <-
  df_uni %>%
  select(
    EXPOSURES_CONTINUOUS,
    EXPOSURES_BINARY,
    EXPOSURES_CATEGORICAL,
    OUTCOMES
  )

# Convert the outcome variable to numeric for correlation calculation
df_ <-
  df_ %>%
  select(where(is.numeric), where(is.logical)) %>%
  mutate(across(everything(), as.numeric))

# Compute the correlation matrix
cor_matrix <- cor(df_, use = "complete.obs", method = "spearman")

# Plot the correlation matrix
ggcorrplot::ggcorrplot(
  cor_matrix,
  method = "circle",
  lab = TRUE,
  lab_size = 2,
  colors = c("red", "white", "green4"),
  title = "Correlation Matrix",
  hc.order = TRUE,
) +
  theme(
    axis.text.x = element_text(size = 8), # Adjust x-axis text size
    axis.text.y = element_text(size = 8) # Adjust y-axis text size
  )


Univariable statistics

Setup

Define function.

fit_model <- function(
    df = df_uni, cols = cols, is_sandwich = T) {
  # Initialize
  out <- list()

  for (i in seq_along(cols)) {
    # Create formula
    model <- as.formula(paste(CHOSEN_OUTCOME, "~", cols[i]))

    fit <- suppressWarnings(glm(
      model,
      data = df,
      family = binomial()
    ))

    if (is_sandwich) {
      # Calculate robust standard errors (sandwich)
      robust_se <- sandwich::vcovHC(fit, type = "HC0")
      # Compute robust confidence intervals
      fit_results <- lmtest::coeftest(fit, vcov. = robust_se)
    } else {
      fit_results <- fit
    }

    # Summarize model coefficients
    fit_summary <-
      fit_results %>%
      # broom does not support exponentiation after coeftest so do it manually
      broom::tidy(conf.int = T) %>%
      mutate(across(c(estimate, "conf.low", "conf.high"), exp)) %>%
      arrange(term == "(Intercept)", p.value) %>%
      rename(odds_ratio = estimate, z_value = statistic)

    # Stylize and print
    stylized <-
      fit_summary %>%
      rename(
        "Predictors" = term,
        "Odds Ratios (OR)" = odds_ratio,
        "SE" = std.error,
        "Z-scores" = z_value,
        "P-values" = p.value,
        "CI (low)" = conf.low,
        "CI (high)" = conf.high,
      ) %>%
      mutate(
        `Odds Ratios (OR)` = round(`Odds Ratios (OR)`, 2),
        `SE` = round(`SE`, 2),
        `CI (low)` = round(`CI (low)`, 2),
        `CI (high)` = round(`CI (high)`, 2),
        `P-values` = round(`P-values`, 3),
      )

    out[[i]] <- stylized
  }
  return(out)
}

Unadjusted - Original

Fit logistic.

# Define cols
cols <- unname(c(
  EXPOSURES_CONTINUOUS,
  EXPOSURES_BINARY,
  EXPOSURES_CATEGORICAL
))

# Fit model
out <- fit_model(df = df_uni, cols = cols, is_sandwich = T)

# Create table
univariable_unadjusted <-
  out %>%
  bind_rows() %>%
  filter(Predictors != "(Intercept)") %>%
  filter(`CI (high)` < 50) %>%
  arrange(`P-values`)

# Print
univariable_unadjusted %>%
  sable()
Predictors Odds Ratios (OR) SE Z-scores P-values CI (low) CI (high)
modified_rankin_score_final 0.57 0.12 -4.662 0.000 0.45 0.72
max_size_cm 0.68 0.08 -4.683 0.000 0.58 0.80
spetzler_martin_grade 0.40 0.16 -5.745 0.000 0.29 0.55
lawton_young_grade 0.47 0.12 -6.187 0.000 0.37 0.60
size_score 0.33 0.23 -4.680 0.000 0.21 0.53
is_eloquent_locationTRUE 0.15 0.46 -4.119 0.000 0.06 0.37
is_diffuse_nidusTRUE 0.20 0.40 -3.977 0.000 0.09 0.44
is_spetzler_martin_grade_less_than_4TRUE 4.76 0.34 4.617 0.000 2.46 9.24
is_surgeryTRUE 10.14 0.39 5.972 0.000 4.74 21.70
venous_drainageSuperficial 4.45 0.42 3.595 0.000 1.97 10.05
has_hemorrhageTRUE 2.68 0.32 3.074 0.002 1.43 5.03
venous_drainageDeep 2.91 0.41 2.597 0.009 1.30 6.53
locationDeep 0.07 1.05 -2.540 0.011 0.01 0.54
modified_rankin_score_presentation 1.26 0.10 2.388 0.017 1.04 1.52
has_deep_venous_drainageTRUE 0.45 0.34 -2.351 0.019 0.24 0.88
has_deficitTRUE 0.50 0.32 -2.202 0.028 0.27 0.93
locationOther 0.10 1.60 -1.473 0.141 0.00 2.17
locationHemispheric 0.24 1.05 -1.341 0.180 0.03 1.91
procedure_combinationsERS 3.75 1.00 1.316 0.188 0.52 26.84
has_associated_aneurysmTRUE 0.62 0.38 -1.281 0.200 0.30 1.29
modified_rankin_score_pretreatment 1.10 0.11 0.844 0.399 0.88 1.38
procedure_combinationsR 2.05 0.96 0.748 0.454 0.31 13.51
procedure_combinationsER 1.87 0.96 0.654 0.513 0.29 12.33
has_paresisTRUE 0.80 0.35 -0.641 0.521 0.40 1.59
is_maleTRUE 0.83 0.32 -0.593 0.553 0.44 1.55
has_seizuresTRUE 0.88 0.36 -0.363 0.717 0.44 1.77
modified_rankin_score_postop_within_1_week 0.96 0.12 -0.355 0.722 0.75 1.22
age_at_first_treatment_yrs 1.01 0.04 0.307 0.759 0.94 1.09

Adjusted - Original

Fit logistic.

cols <- unname(c(
  EXPOSURES_CONTINUOUS,
  EXPOSURES_BINARY,
  EXPOSURES_CATEGORICAL
))

# Initialize
out <- list()
df <- df_uni

for (i in seq_along(cols)) {
  # Create formula
  model <- as.formula(paste(
    CHOSEN_OUTCOME,
    "~",
    "age_at_first_treatment_yrs + is_male +",
    cols[i]
  ))

  fit <- suppressWarnings(glm(
    model,
    data = df,
    family = binomial()
  ))

  # Calculate robust standard errors (sandwich)
  robust_se <- sandwich::vcovHC(fit, type = "HC0")
  # Compute robust confidence intervals
  robust_ci <- lmtest::coeftest(fit, vcov. = robust_se)

  # Summarize model coefficients
  fit_summary <-
    robust_ci %>%
    # broom does not support exponentiation after coeftest so do it manually
    broom::tidy(conf.int = T) %>%
    mutate(across(c(estimate, "conf.low", "conf.high"), exp)) %>%
    arrange(term == "(Intercept)", p.value) %>%
    rename(odds_ratio = estimate, z_value = statistic)

  # Stylize and print
  stylized <-
    fit_summary %>%
    rename(
      "Predictors" = term,
      "Odds Ratios (OR)" = odds_ratio,
      "SE" = std.error,
      "Z-scores" = z_value,
      "P-values" = p.value,
      "CI (low)" = conf.low,
      "CI (high)" = conf.high,
    ) %>%
    mutate(
      `Odds Ratios (OR)` = round(`Odds Ratios (OR)`, 2),
      `SE` = round(`SE`, 2),
      `CI (low)` = round(`CI (low)`, 2),
      `CI (high)` = round(`CI (high)`, 2),
      `P-values` = round(`P-values`, 3),
    )

  out[[i]] <- stylized
}

Print results.

# Create table
univariable_adjusted <-
  out %>%
  bind_rows() %>%
  filter(Predictors != "(Intercept)") %>%
  filter(!str_starts(Predictors, "age_at_first_treatment_yrs|is_male")) %>%
  filter(`CI (high)` < 50) %>%
  arrange(`P-values`)

# Print
univariable_adjusted %>%
  sable()
Predictors Odds Ratios (OR) SE Z-scores P-values CI (low) CI (high)
modified_rankin_score_final 0.55 0.13 -4.656 0.000 0.43 0.71
max_size_cm 0.67 0.08 -4.823 0.000 0.57 0.79
spetzler_martin_grade 0.38 0.16 -5.899 0.000 0.28 0.53
lawton_young_grade 0.47 0.12 -6.212 0.000 0.37 0.59
size_score 0.32 0.24 -4.811 0.000 0.20 0.51
is_eloquent_locationTRUE 0.15 0.47 -4.031 0.000 0.06 0.38
is_diffuse_nidusTRUE 0.19 0.40 -4.063 0.000 0.09 0.43
is_spetzler_martin_grade_less_than_4TRUE 5.02 0.34 4.740 0.000 2.58 9.78
is_surgeryTRUE 10.16 0.40 5.824 0.000 4.65 22.16
venous_drainageSuperficial 4.90 0.43 3.735 0.000 2.13 11.27
has_hemorrhageTRUE 2.66 0.33 2.981 0.003 1.40 5.08
venous_drainageDeep 3.01 0.41 2.674 0.008 1.34 6.76
locationDeep 0.06 1.05 -2.603 0.009 0.01 0.51
has_deep_venous_drainageTRUE 0.42 0.35 -2.480 0.013 0.21 0.84
modified_rankin_score_presentation 1.26 0.10 2.376 0.017 1.04 1.52
has_deficitTRUE 0.48 0.32 -2.343 0.019 0.26 0.89
has_associated_aneurysmTRUE 0.58 0.38 -1.409 0.159 0.28 1.23
locationHemispheric 0.23 1.05 -1.405 0.160 0.03 1.79
procedure_combinationsERS 3.44 1.07 1.159 0.247 0.43 27.84
modified_rankin_score_pretreatment 1.09 0.11 0.731 0.465 0.87 1.36
procedure_combinationsR 2.09 1.03 0.718 0.473 0.28 15.76
has_paresisTRUE 0.78 0.35 -0.691 0.490 0.39 1.56
modified_rankin_score_postop_within_1_week 0.94 0.12 -0.510 0.610 0.74 1.19
procedure_combinationsER 1.63 1.03 0.475 0.635 0.22 12.23
has_seizuresTRUE 0.87 0.36 -0.407 0.684 0.43 1.74

Unadjusted - Recoded

Fit logistic.

# Define columns
cols <- unname(c(
  EXPOSURES_CONTINUOUS,
  EXPOSURES_BINARY,
  EXPOSURES_CATEGORICAL
))

# Initialize
out <- list()
df <- df_multi

for (i in seq_along(cols)) {
  # Create formula
  model <- as.formula(paste(CHOSEN_OUTCOME, "~", cols[i]))

  fit <- suppressWarnings(glm(
    model,
    data = df,
    family = binomial()
  ))

  # Calculate robust standard errors (sandwich)
  robust_se <- sandwich::vcovHC(fit, type = "HC0")
  # Compute robust confidence intervals
  robust_ci <- lmtest::coeftest(fit, vcov. = robust_se)

  # Summarize model coefficients
  fit_summary <-
    robust_ci %>%
    # broom does not support exponentiation after coeftest so do it manually
    broom::tidy(conf.int = T) %>%
    mutate(across(c(estimate, "conf.low", "conf.high"), exp)) %>%
    arrange(term == "(Intercept)", p.value) %>%
    rename(odds_ratio = estimate, z_value = statistic)

  # Stylize and print
  stylized <-
    fit_summary %>%
    rename(
      "Predictors" = term,
      "Odds Ratios (OR)" = odds_ratio,
      "SE" = std.error,
      "Z-scores" = z_value,
      "P-values" = p.value,
      "CI (low)" = conf.low,
      "CI (high)" = conf.high,
    ) %>%
    mutate(
      `Odds Ratios (OR)` = round(`Odds Ratios (OR)`, 2),
      `SE` = round(`SE`, 2),
      `CI (low)` = round(`CI (low)`, 2),
      `CI (high)` = round(`CI (high)`, 2),
      `P-values` = round(`P-values`, 3),
    )

  out[[i]] <- stylized
}

Print results.

# Create table
univariable_unadjusted_recoded <-
  out %>%
  bind_rows() %>%
  filter(Predictors != "(Intercept)") %>%
  filter(`CI (high)` < 50) %>%
  arrange(`P-values`)

# Print
univariable_unadjusted_recoded %>%
  sable()
Predictors Odds Ratios (OR) SE Z-scores P-values CI (low) CI (high)
modified_rankin_score_final 0.57 0.12 -4.662 0.000 0.45 0.72
max_size_cm 0.68 0.08 -4.683 0.000 0.58 0.80
spetzler_martin_grade 0.40 0.16 -5.745 0.000 0.29 0.55
lawton_young_grade 0.47 0.12 -6.187 0.000 0.37 0.60
size_score 0.33 0.23 -4.680 0.000 0.21 0.53
is_eloquent_locationTRUE 0.15 0.46 -4.119 0.000 0.06 0.37
is_diffuse_nidusTRUE 0.20 0.40 -3.977 0.000 0.09 0.44
is_spetzler_martin_grade_less_than_4TRUE 4.76 0.34 4.617 0.000 2.46 9.24
has_hemorrhageTRUE 2.68 0.32 3.074 0.002 1.43 5.03
locationDeep 0.09 0.77 -3.120 0.002 0.02 0.41
venous_drainageSuperficial 2.28 0.34 2.452 0.014 1.18 4.39
modified_rankin_score_presentation 1.26 0.10 2.388 0.017 1.04 1.52
has_deep_venous_drainageTRUE 0.45 0.34 -2.351 0.019 0.24 0.88
has_deficitTRUE 0.50 0.32 -2.202 0.028 0.27 0.93
procedure_combinationsMultimodal 4.90 0.94 1.697 0.090 0.78 30.71
locationHemispheric 0.32 0.77 -1.485 0.138 0.07 1.44
has_associated_aneurysmTRUE 0.62 0.38 -1.281 0.200 0.30 1.29
modified_rankin_score_pretreatment 1.10 0.11 0.844 0.399 0.88 1.38
procedure_combinationsR 2.05 0.96 0.748 0.454 0.31 13.51
has_paresisTRUE 0.80 0.35 -0.641 0.521 0.40 1.59
is_maleTRUE 0.83 0.32 -0.593 0.553 0.44 1.55
has_seizuresTRUE 0.88 0.36 -0.363 0.717 0.44 1.77
modified_rankin_score_postop_within_1_week 0.96 0.12 -0.355 0.722 0.75 1.22
age_at_first_treatment_yrs 1.01 0.04 0.307 0.759 0.94 1.09

Adjusted - Recoded

Fit logistic.

# Define columns
cols <- unname(c(
  EXPOSURES_CONTINUOUS,
  EXPOSURES_BINARY,
  EXPOSURES_CATEGORICAL
))

# Initialize
out <- list()
df <- df_multi

for (i in seq_along(cols)) {
  # Create formula
  model <- as.formula(paste(
    CHOSEN_OUTCOME,
    "~",
    "age_at_first_treatment_yrs + is_male +",
    cols[i]
  ))

  fit <- suppressWarnings(glm(
    model,
    data = df,
    family = binomial()
  ))

  # Calculate robust standard errors (sandwich)
  robust_se <- sandwich::vcovHC(fit, type = "HC0")
  # Compute robust confidence intervals
  robust_ci <- lmtest::coeftest(fit, vcov. = robust_se)

  # Summarize model coefficients
  fit_summary <-
    robust_ci %>%
    # broom does not support exponentiation after coeftest so do it manually
    broom::tidy(conf.int = T) %>%
    mutate(across(c(estimate, "conf.low", "conf.high"), exp)) %>%
    arrange(term == "(Intercept)", p.value) %>%
    rename(odds_ratio = estimate, z_value = statistic)

  # Stylize and print
  stylized <-
    fit_summary %>%
    rename(
      "Predictors" = term,
      "Odds Ratios (OR)" = odds_ratio,
      "SE" = std.error,
      "Z-scores" = z_value,
      "P-values" = p.value,
      "CI (low)" = conf.low,
      "CI (high)" = conf.high,
    ) %>%
    mutate(
      `Odds Ratios (OR)` = round(`Odds Ratios (OR)`, 2),
      `SE` = round(`SE`, 2),
      `CI (low)` = round(`CI (low)`, 2),
      `CI (high)` = round(`CI (high)`, 2),
      `P-values` = round(`P-values`, 3),
    )

  out[[i]] <- stylized
}

Print results.

# Create table
univariable_adjusted_recoded <-
  out %>%
  bind_rows() %>%
  filter(Predictors != "(Intercept)") %>%
  filter(!str_starts(Predictors, "age_at_first_treatment_yrs|is_male")) %>%
  filter(`CI (high)` < 50) %>%
  arrange(`P-values`)

# Print
univariable_adjusted_recoded %>%
  sable()
Predictors Odds Ratios (OR) SE Z-scores P-values CI (low) CI (high)
modified_rankin_score_final 0.55 0.13 -4.656 0.000 0.43 0.71
max_size_cm 0.67 0.08 -4.823 0.000 0.57 0.79
spetzler_martin_grade 0.38 0.16 -5.899 0.000 0.28 0.53
lawton_young_grade 0.47 0.12 -6.212 0.000 0.37 0.59
size_score 0.32 0.24 -4.811 0.000 0.20 0.51
is_eloquent_locationTRUE 0.15 0.47 -4.031 0.000 0.06 0.38
is_diffuse_nidusTRUE 0.19 0.40 -4.063 0.000 0.09 0.43
is_spetzler_martin_grade_less_than_4TRUE 5.02 0.34 4.740 0.000 2.58 9.78
has_hemorrhageTRUE 2.66 0.33 2.981 0.003 1.40 5.08
locationDeep 0.04 1.04 -3.022 0.003 0.01 0.33
venous_drainageSuperficial 2.45 0.35 2.582 0.010 1.24 4.82
has_deep_venous_drainageTRUE 0.42 0.35 -2.480 0.013 0.21 0.84
modified_rankin_score_presentation 1.26 0.10 2.376 0.017 1.04 1.52
has_deficitTRUE 0.48 0.32 -2.343 0.019 0.26 0.89
locationHemispheric 0.15 1.04 -1.815 0.069 0.02 1.16
procedure_combinationsMultimodal 4.53 0.98 1.548 0.122 0.67 30.72
has_associated_aneurysmTRUE 0.58 0.38 -1.409 0.159 0.28 1.23
procedure_combinationsR 2.09 1.00 0.735 0.462 0.29 14.87
modified_rankin_score_pretreatment 1.09 0.11 0.731 0.465 0.87 1.36
has_paresisTRUE 0.78 0.35 -0.691 0.490 0.39 1.56
modified_rankin_score_postop_within_1_week 0.94 0.12 -0.510 0.610 0.74 1.19
has_seizuresTRUE 0.87 0.36 -0.407 0.684 0.43 1.74

Interaction analysis

Adjusted - Recoded

Fit logistic.

# Define columns
cols <- unname(c(
  EXPOSURES_CONTINUOUS,
  EXPOSURES_BINARY,
  EXPOSURES_CATEGORICAL
))

# Remove unwanted columns
unwanted <- c("modified_rankin_score_final")
cols <- cols[!cols %in% unwanted]

adjustors <- c(
  "age_at_first_treatment_yrs",
  "is_male"
)

# Initialize
out <- list()
df <- df_multi
k <- 1

for (i in seq_along(cols)) {
  # Do not fit model for variables that we are adjusting for
  if (cols[i] %in% adjustors) {
    next
  }

  # Create formula
  model <- as.formula(paste(
    CHOSEN_OUTCOME,
    "~",
    "age_at_first_treatment_yrs + is_male +",
    cols[i],
    "* has_hemorrhage",
    sep = ""
  ))

  fit <- suppressWarnings(glm(
    model,
    data = df,
    family = binomial()
  ))

  # Calculate robust standard errors (sandwich)
  robust_se <- sandwich::vcovHC(fit, type = "HC0")
  # Compute robust confidence intervals
  robust_ci <- lmtest::coeftest(fit, vcov. = robust_se)

  # Summarize model coefficients
  fit_summary <-
    robust_ci %>%
    # broom does not support exponentiation after coeftest so do it manually
    broom::tidy(conf.int = T) %>%
    mutate(across(c(estimate, "conf.low", "conf.high"), exp)) %>%
    arrange(term == "(Intercept)", p.value) %>%
    rename(odds_ratio = estimate, z_value = statistic)

  # Stylize and print
  stylized <-
    fit_summary %>%
    rename(
      "Predictors" = term,
      "Odds Ratios (OR)" = odds_ratio,
      "SE" = std.error,
      "Z-scores" = z_value,
      "P-values" = p.value,
      "CI (low)" = conf.low,
      "CI (high)" = conf.high,
    ) %>%
    mutate(
      `Odds Ratios (OR)` = round(`Odds Ratios (OR)`, 2),
      `SE` = round(`SE`, 2),
      `CI (low)` = round(`CI (low)`, 2),
      `CI (high)` = round(`CI (high)`, 2),
      `P-values` = round(`P-values`, 3),
    )

  out[[k]] <- stylized
  k <- k + 1
}

Print all results results.

# Define values
unwanted <- c(
  "is_maleTRUE",
  "age_at_first_treatment_yrs",
  "(Intercept)",
  "has_hemorrhageTRUE"
)

# Initialize objects
isolated_values <- list()

# Get main and interaction effects
for (i in seq_along(out)) {
  # Get data
  result <- out[[i]]

  # Get all main effects of interest
  main_effects <-
    result %>%
    filter(!Predictors %in% unwanted) %>%
    filter(!str_detect(Predictors, ":")) %>%
    select(-"SE", -"Z-scores")

  # Get the interaction effects
  interaction_effects <-
    result %>%
    filter(!Predictors %in% unwanted) %>%
    filter(str_detect(Predictors, ":")) %>%
    mutate(Predictors = str_extract(Predictors, "^[^:]+")) %>%
    select(-"SE", -"Z-scores") %>%
    rename_with(~ paste("Interaction -", .), -Predictors)

  isolated_values[[i]] <-
    main_effects %>%
    left_join(interaction_effects, by = "Predictors")
}

Create and print table

# Create table
univariable_adjusted_recoded_interactions <-
  isolated_values %>%
  bind_rows() %>%
  arrange(`Interaction - P-values`)

# Print
univariable_adjusted_recoded_interactions %>%
  sable()
Predictors Odds Ratios (OR) P-values CI (low) CI (high) Interaction - Odds Ratios (OR) Interaction - P-values Interaction - CI (low) Interaction - CI (high)
is_eloquent_locationTRUE 5.00e-01 0.190 1.70e-01 1.42e+00 0.00 0.000 0.00 0.00
locationDeep 2.20e-01 0.195 2.00e-02 2.15e+00 0.00 0.000 0.00 0.00
locationHemispheric 5.80e-01 0.629 6.00e-02 5.40e+00 0.00 0.000 0.00 0.00
has_seizuresTRUE 1.49e+00 0.431 5.50e-01 4.03e+00 0.38 0.183 0.09 1.58
size_score 2.50e-01 0.000 1.30e-01 5.10e-01 1.87 0.226 0.68 5.16
is_surgeryTRUE 6.51e+07 0.000 2.79e+07 1.52e+08 0.58 0.296 0.21 1.60
max_size_cm 6.40e-01 0.000 5.00e-01 8.20e-01 1.16 0.370 0.84 1.62
venous_drainageSuperficial 3.82e+00 0.006 1.48e+00 9.87e+00 0.56 0.411 0.14 2.23
has_paresisTRUE 1.00e+00 0.994 3.10e-01 3.20e+00 0.55 0.448 0.11 2.61
has_deep_venous_drainageTRUE 2.80e-01 0.009 1.10e-01 7.30e-01 1.69 0.453 0.43 6.72
modified_rankin_score_presentation 9.80e-01 0.929 6.40e-01 1.50e+00 1.18 0.519 0.71 1.95
is_spetzler_martin_grade_less_than_4TRUE 5.17e+00 0.001 1.97e+00 1.36e+01 0.70 0.626 0.17 2.93
has_associated_aneurysmTRUE 3.90e-01 0.134 1.10e-01 1.34e+00 1.40 0.679 0.28 6.99
has_deficitTRUE 4.10e-01 0.069 1.50e-01 1.07e+00 1.31 0.713 0.31 5.50
is_diffuse_nidusTRUE 1.70e-01 0.004 5.00e-02 5.70e-01 1.36 0.718 0.26 7.16
modified_rankin_score_pretreatment 9.00e-01 0.656 5.60e-01 1.45e+00 1.07 0.809 0.61 1.87
lawton_young_grade 4.60e-01 0.000 3.10e-01 6.90e-01 1.05 0.855 0.62 1.78
modified_rankin_score_postop_within_1_week 8.70e-01 0.438 6.00e-01 1.25e+00 0.96 0.869 0.58 1.58
spetzler_martin_grade 4.10e-01 0.000 2.50e-01 6.50e-01 1.00 0.994 0.53 1.91
procedure_combinationsMultimodal 2.43e+08 0.00
procedure_combinationsR 1.23e+08 0.00
procedure_combinationsS 1.22e+16 0.00

Eloquence.

df %>%
  count(has_hemorrhage, is_eloquent_location, !!sym(CHOSEN_OUTCOME)) %>%
  drop_na() %>%
  arrange(has_hemorrhage, is_eloquent_location) %>%
  group_by(has_hemorrhage, is_eloquent_location) %>%
  mutate(
    num_total = sum(n),
    pct_total = scales::percent(n / num_total, 1)
  ) %>%
  sable()
has_hemorrhage is_eloquent_location is_obliterated n num_total pct_total
FALSE FALSE FALSE 6 28 21%
FALSE FALSE TRUE 22 28 79%
FALSE TRUE FALSE 24 64 38%
FALSE TRUE TRUE 40 64 62%
TRUE FALSE TRUE 61 61 100%
TRUE TRUE FALSE 22 78 28%
TRUE TRUE TRUE 56 78 72%

Deficit.

df %>%
  count(has_hemorrhage, has_deficit, !!sym(CHOSEN_OUTCOME)) %>%
  drop_na() %>%
  arrange(has_hemorrhage, has_deficit) %>%
  group_by(has_hemorrhage, has_deficit) %>%
  mutate(
    num_total = sum(n),
    pct_total = scales::percent(n / num_total, 1)
  ) %>%
  sable()
has_hemorrhage has_deficit is_obliterated n num_total pct_total
FALSE FALSE FALSE 15 58 26%
FALSE FALSE TRUE 43 58 74%
FALSE TRUE FALSE 15 34 44%
FALSE TRUE TRUE 19 34 56%
TRUE FALSE FALSE 11 91 12%
TRUE FALSE TRUE 80 91 88%
TRUE TRUE FALSE 11 53 21%
TRUE TRUE TRUE 42 53 79%

Seizures.

df %>%
  count(has_hemorrhage, has_seizures, !!sym(CHOSEN_OUTCOME)) %>%
  drop_na() %>%
  arrange(has_hemorrhage, has_seizures) %>%
  group_by(has_hemorrhage, has_seizures) %>%
  mutate(
    num_total = sum(n),
    pct_total = scales::percent(n / num_total, 1)
  ) %>%
  sable()
has_hemorrhage has_seizures is_obliterated n num_total pct_total
FALSE FALSE FALSE 23 65 35%
FALSE FALSE TRUE 42 65 65%
FALSE TRUE FALSE 7 27 26%
FALSE TRUE TRUE 20 27 74%
TRUE FALSE FALSE 15 112 13%
TRUE FALSE TRUE 97 112 87%
TRUE TRUE FALSE 7 32 22%
TRUE TRUE TRUE 25 32 78%

Selective inference

Read the following guides: - How to use glmnet - Selective inference using forward selection

Setup

Define unwanted columns.

unwanted_all <- c(
  # Use values at or before first treatment
  "modified_rankin_score_postop_within_1_week",
  "modified_rankin_score_final"
)

unwanted_without_gradings <- c(
  unwanted_all,
  # Spetzler-Martin grade (size score + eloquent location + venous drainage)
  "spetzler_martin_grade",
  "is_spetzler_martin_grade_less_than_4",
  "size_score", # Already covered by the more detailed max_size_cm
  "venous_drainage", # Already covered by has_deep_venous_drainage
  "location", # Already covered to some extent by eloquence
  # Lawton-Young grade (SM + age + diffuse nidus + hemorrhage)
  # (SM is heavily overlapping with LY, so include its components)
  "lawton_young_grade",
  # Use mRS pre-treatment (more predictive)
  "modified_rankin_score_presentation" # "modified_rankin_score_pretreatment"
)

# Define cols of interest - use grading scores whenever these exist
unwanted_with_gradings <- c(
  # Spetzler-Martin grade (size score + eloquent location + venous drainage)
  "is_spetzler_martin_grade_less_than_4", # The grade has lower p-value
  "size_score",
  "max_size_cm",
  "is_eloquent_location",
  # "location",  # Include this as not highly correlated with SM
  "has_deep_venous_drainage",
  "venous_drainage",
  # Lawton-Young grade (SM + age + diffuse nidus + hemorrhage)
  # (SM is heavily overlapping with LY, so include its components)
  "lawton_young_grade",
  # "is_diffuse_nidus",
  # "has_hemorrhage",
  # Use values at or before first treatment
  "modified_rankin_score_final",
  "modified_rankin_score_postop_within_1_week",
  # Use mRS pre-treatment (more predictive)
  "modified_rankin_score_presentation"
  # "modified_rankin_score_pretreatment"
)

Create dataset.

# Define cols
cols <- unname(c(
  EXPOSURES_CONTINUOUS,
  EXPOSURES_BINARY,
  EXPOSURES_CATEGORICAL,
  CHOSEN_OUTCOME
))

# Define column sets for each analysis
cols_all <- cols[!cols %in% unwanted_all]
cols_with_gradings <- cols[!cols %in% unwanted_with_gradings]
cols_without_gradings <- cols[!cols %in% unwanted_without_gradings]

# Create df of interest
df_all <-
  # Use the non-recoded dataset
  df_uni %>%
  dplyr::select(all_of(cols_all)) %>%
  drop_na()

df_with_gradings <-
  # Use the non-recoded dataset
  df_uni %>%
  dplyr::select(all_of(cols_with_gradings)) %>%
  drop_na()

df_no_scores <-
  # Use the non-recoded dataset
  df_uni %>%
  dplyr::select(all_of(cols_without_gradings)) %>%
  drop_na()

# Create formula
frla <- as.formula(paste(CHOSEN_OUTCOME, " ~ . - 1"))

# Create matrices with all variables
X_all <- model.matrix(frla, df_all)
y_all <- df_all %>%
  pull(CHOSEN_OUTCOME) %>%
  as.numeric()

# Create matrices with scores
X_with_gradings <- model.matrix(frla, df_with_gradings)
y_with_gradings <- df_with_gradings %>%
  pull(CHOSEN_OUTCOME) %>%
  as.numeric()

# Create matrices with score components
X_without_gradings <- model.matrix(frla, df_no_scores)
y_without_gradings <- df_no_scores %>%
  pull(CHOSEN_OUTCOME) %>%
  as.numeric()

# X should be centered for LASSO (necessary for glmnet)
# (scale = FALSE as this is done by the standardization step in glmnet)
X_all_scaled <- scale(X_all, center = T, scale = F)
X_with_gradings_scaled <- scale(X_with_gradings, center = T, scale = F)
X_without_gradings_scaled <- scale(X_without_gradings, center = T, scale = F)

# See the names of X to match column numbers to column names later on
colnames(X_without_gradings_scaled)
 [1] "age_at_first_treatment_yrs"         "modified_rankin_score_pretreatment"
 [3] "max_size_cm"                        "is_maleFALSE"                      
 [5] "is_maleTRUE"                        "is_eloquent_locationTRUE"          
 [7] "has_deep_venous_drainageTRUE"       "is_diffuse_nidusTRUE"              
 [9] "has_hemorrhageTRUE"                 "has_seizuresTRUE"                  
[11] "has_deficitTRUE"                    "has_paresisTRUE"                   
[13] "has_associated_aneurysmTRUE"        "is_surgeryTRUE"                    
[15] "procedure_combinationsER"           "procedure_combinationsERS"         
[17] "procedure_combinationsES"           "procedure_combinationsR"           
[19] "procedure_combinationsRS"           "procedure_combinationsS"           

Stepwise

Use step-wise linear regression methods.

# Set seed
set.seed(33)

# Run forward step-wise
fsfit <- fs(
  x = X_without_gradings_scaled,
  y = y_without_gradings,
  maxsteps = 2000,
  intercept = TRUE,
  normalize = FALSE
)

# Estimate sigma
sigmahat <- estimateSigma(
  x = X_without_gradings_scaled,
  y = y_without_gradings,
  intercept = TRUE,
  standardize = FALSE
)$sigmahat
# Compute sequential p-values and confidence intervals - for all
# (sigma estimated from full model)
out.seq <- fsInf(fsfit, type = "active", sigma = sigmahat)
out.seq

Call:
fsInf(obj = fsfit, sigma = sigmahat, type = "active")

Standard deviation of noise (specified or estimated) sigma = 0.348

Sequential testing results with alpha = 0.100
 Step Var      Coef Z-score P-value LowConfPt  UpConfPt LowTailArea UpTailArea
    1  14  3.41e-01   7.153   0.052 -5.00e-03  4.07e-01       0.050      0.049
    2   3 -5.90e-02  -5.794   0.000 -1.45e-01 -4.40e-02       0.050      0.050
    3  11 -1.24e-01  -2.546   0.194 -2.03e-01  1.12e-01       0.049      0.049
    4   8 -1.32e-01  -1.683   0.296 -5.53e-01  3.07e-01       0.050      0.050
    5   5 -5.20e-02  -1.101   1.000  4.68e+00       Inf       0.000      0.000
    6   9  5.10e-02   1.029   1.000      -Inf -5.00e+00       0.000        NaN
    7   6 -5.40e-02  -0.986   1.000  5.49e+00       Inf         NaN      0.000
    8  17  6.70e-02   0.936   1.000      -Inf -7.16e+00       0.000        NaN
    9   4  4.20e+13   0.784   1.000      -Inf -5.36e+15       0.000      0.000
   10  18  4.50e-02   0.599   1.000      -Inf -7.57e+00       0.000        NaN
   11  15  2.52e-01   1.514   1.000      -Inf -1.66e+01       0.000        NaN
   12   1  2.00e-03   0.358   1.000      -Inf -6.24e-01       0.000        NaN
   13  16 -3.00e-02  -0.325   1.000  9.12e+00       Inf         NaN      0.000
   14   7  6.00e-03   0.117   1.000      -Inf -5.45e+00       0.000        NaN
   15  12  6.00e-03   0.074   1.000      -Inf -8.27e+00       0.000        NaN
   16  20 -5.00e-03  -0.052   1.000  9.71e+00       Inf         NaN      0.000
   17  19 -5.15e+13  -1.811   1.000  2.84e+15       Inf       0.000      0.000
   18  13 -6.00e-03  -0.099   1.000  6.23e+00       Inf       0.000      0.000
   19  10  4.00e-03   0.062   1.000      -Inf -5.70e+00       0.000      0.000
   20   2  1.00e-03   0.049   1.000      -Inf -2.00e+00       0.000        NaN

Estimated stopping point from ForwardStop rule = 3
# Compute p-values and confidence intervals after AIC stopping - for selected
out.aic <- fsInf(fsfit, type = "aic", sigma = sigmahat)
out.aic

Call:
fsInf(obj = fsfit, sigma = sigmahat, type = "aic")

Standard deviation of noise (specified or estimated) sigma = 0.348

Testing results at step = 4, with alpha = 0.100
 Var   Coef Z-score P-value LowConfPt UpConfPt LowTailArea UpTailArea
  14  0.298    6.18       1      -Inf    -4.82           0          0
   3 -0.046   -3.93       1      1.18      Inf         NaN          0
  11 -0.113   -2.31       1      4.90      Inf         NaN          0
   8 -0.132   -1.68       1      7.85      Inf         NaN          0

Estimated stopping point from AIC rule = 4
# Compute p-values and confidence intervals after 5 fixed steps
out.fix <- fsInf(fsfit, type = "all", k = 5, sigma = sigmahat)
out.fix

Call:
fsInf(obj = fsfit, sigma = sigmahat, k = 5, type = "all")

Standard deviation of noise (specified or estimated) sigma = 0.348

Testing results at step = 5, with alpha = 0.100
 Var   Coef Z-score P-value LowConfPt UpConfPt LowTailArea UpTailArea
  14  0.298    6.18       1      -Inf    -4.82           0          0
   3 -0.046   -3.93       1      1.18      Inf         NaN          0
  11 -0.113   -2.31       1      4.90      Inf         NaN          0
   8 -0.134   -1.70       1      7.85      Inf           0          0
   5 -0.052   -1.10       1      4.68      Inf           0          0

LASSO - all

Use LASSO logistic regression using all available variables.

# Set seed
set.seed(141845)

# Define values
X <- X_all
y <- y_all

# Perform cross-validation
# (alpha = 1 for lasso, alpha = 0 for ridge, 0 < alpha < 1 for elastic net)
cv_fit <- cv.glmnet(
  x = X,
  y = y,
  family = "binomial",
  alpha = 1,
  nfolds = 10
)

# Extract the best lambda
best_lambda_min <- cv_fit$lambda.min
best_lambda_1se <- cv_fit$lambda.1se

# Run glmnet
gfit <- glmnet(
  x = X,
  y = y,
  family = "binomial",
  alpha = 1,
  intercept = TRUE,
  standardize = TRUE
)

# Extract coef for a given lambda - avoid the intercept term
# (given the small number of observations hence large SD, use min lambda)
lambda <- best_lambda_min
n <- nrow(y)
beta <- coef(
  x = X,
  y = y,
  object = gfit,
  s = lambda,
  exact = TRUE
)

# Estimate sigma
gsigmahat <- estimateSigma(
  x = X,
  y = y,
  intercept = TRUE,
  standardize = TRUE
)$sigmahat

# Compute fixed lambda p-values and selection intervals
out <- fixedLassoInf(
  x = X,
  y = y,
  beta = beta,
  lambda = lambda,
  # Level of significance
  alpha = 0.1,
  family = "binomial",
  intercept = TRUE,
  sigma = gsigmahat
)

# Prettify
tibble(
  names = names(out$vars),
  odds_ratio = exp(out$coef0),
  pvalue = out$pv,
  ci_lo = exp(out$ci[, 1]),
  ci_hi = exp(out$ci[, 2])
) %>%
  arrange(pvalue) %>%
  sable()
names odds_ratio pvalue ci_lo ci_hi
spetzler_martin_grade 0.672 0.053 0.003 1.02e+00
is_surgeryTRUE 7.625 0.105 0.065 4.22e+16
has_hemorrhageTRUE 1.135 0.118 0.052 6.69e+18
procedure_combinationsERS 0.565 0.184 0.000 5.04e+02
venous_drainageDeep 2.685 0.208 0.367 6.10e+00
is_maleFALSE 1.518 0.340 0.059 1.27e+02
is_diffuse_nidusTRUE 0.718 0.366 0.003 3.12e+01
locationCorpus Callosum 5.626 0.469 0.001 1.78e+02
max_size_cm 0.898 0.543 0.730 1.92e+00
locationDeep 0.577 0.788 0.374 1.05e+04
procedure_combinationsS 3.701 0.791 0.000 2.78e+02
procedure_combinationsES 2.337 0.864 0.000 7.10e+01
has_deficitTRUE 0.417 0.877 0.408 1.10e+10
modified_rankin_score_presentation 1.076 0.960 0.000 8.57e-01

LASSO - without scores

Use LASSO logistic regression not based on previous gradings.

# Set seed
set.seed(141845)

# Define values
X <- X_without_gradings
y <- y_without_gradings

# Perform cross-validation
# (alpha = 1 for lasso, alpha = 0 for ridge, 0 < alpha < 1 for elastic net)
cv_fit <- cv.glmnet(
  x = X,
  y = y,
  family = "binomial",
  alpha = 1,
  nfolds = 10
)

# Extract the best lambda
best_lambda_min <- cv_fit$lambda.min
best_lambda_1se <- cv_fit$lambda.1se

# Run glmnet
gfit <- glmnet(
  x = X,
  y = y,
  family = "binomial",
  alpha = 1,
  intercept = TRUE,
  standardize = TRUE
)

# Extract coef for a given lambda - avoid the intercept term
# (given the small number of observations hence large SD, use min lambda)
lambda <- best_lambda_min
n <- nrow(y)
beta <- coef(
  x = X,
  y = y,
  object = gfit,
  s = lambda,
  exact = TRUE
)

# Estimate sigma
gsigmahat <- estimateSigma(
  x = X,
  y = y,
  intercept = TRUE,
  standardize = TRUE
)$sigmahat

# Compute fixed lambda p-values and selection intervals
out <- fixedLassoInf(
  x = X,
  y = y,
  beta = beta,
  lambda = lambda,
  # Level of significance
  alpha = 0.1,
  family = "binomial",
  intercept = TRUE,
  sigma = gsigmahat
)

# Prettify
tibble(
  names = names(out$vars),
  odds_ratio = exp(out$coef0),
  pvalue = out$pv,
  ci_lo = exp(out$ci[, 1]),
  ci_hi = exp(out$ci[, 2])
) %>%
  arrange(pvalue) %>%
  sable()
names odds_ratio pvalue ci_lo ci_hi
max_size_cm 0.801 0.019 0.657 9.52e-01
has_deficitTRUE 0.456 0.085 0.235 1.20e+00
is_surgeryTRUE 8.176 0.111 0.117 2.29e+10
is_eloquent_locationTRUE 0.471 0.202 0.202 2.16e+00
procedure_combinationsERS 0.474 0.213 0.000 1.73e+02
is_diffuse_nidusTRUE 0.623 0.439 0.263 6.64e+00
has_hemorrhageTRUE 1.473 0.514 0.135 2.69e+00
procedure_combinationsS 3.739 0.705 0.000 2.32e+02
is_maleFALSE 1.378 0.723 0.012 2.34e+00
procedure_combinationsES 2.069 0.819 0.000 6.87e+01

LASSO - without components

Use LASSO logistic regression based on previous gradings.

# Set seed
set.seed(141845)

# Define values
X <- X_with_gradings
y <- y_with_gradings

# Perform cross-validation
# (alpha = 1 for lasso, alpha = 0 for ridge, 0 < alpha < 1 for elastic net)
cv_fit <- cv.glmnet(
  x = X,
  y = y,
  family = "binomial",
  alpha = 1,
  nfolds = 10
)

# Extract the best lambda
best_lambda_min <- cv_fit$lambda.min
best_lambda_1se <- cv_fit$lambda.1se

# Run glmnet
gfit <- glmnet(
  x = X,
  y = y,
  family = "binomial",
  alpha = 1,
  intercept = TRUE,
  standardize = TRUE
)

# Extract coef for a given lambda - avoid the intercept term
# (given the small number of observations hence large SD, use min lambda)
lambda <- best_lambda_min
n <- nrow(y)
beta <- coef(
  x = X,
  y = y,
  object = gfit,
  s = lambda,
  exact = TRUE
)

# Estimate sigma
gsigmahat <- estimateSigma(
  x = X,
  y = y,
  intercept = TRUE,
  standardize = TRUE
)$sigmahat

# Compute fixed lambda p-values and selection intervals
out <- fixedLassoInf(
  x = X,
  y = y,
  beta = beta,
  lambda = lambda,
  # Level of significance
  alpha = 0.1,
  family = "binomial",
  intercept = TRUE,
  sigma = gsigmahat
)

# Prettify
tibble(
  names = names(out$vars),
  odds_ratio = exp(out$coef0),
  pvalue = out$pv,
  ci_lo = exp(out$ci[, 1]),
  ci_hi = exp(out$ci[, 2])
) %>%
  arrange(pvalue) %>%
  sable()
names odds_ratio pvalue ci_lo ci_hi
spetzler_martin_grade 0.571 0.013 0.300 8.50e-01
is_surgeryTRUE 11.645 0.072 0.645 1.66e+03
has_deficitTRUE 0.437 0.125 0.233 1.50e+00
is_diffuse_nidusTRUE 0.610 0.246 0.011 4.54e+00
procedure_combinationsERS 0.349 0.334 0.003 3.37e+01
locationCorpus Callosum 7.891 0.400 0.003 5.15e+01
has_hemorrhageTRUE 1.449 0.442 0.183 4.76e+00
is_maleFALSE 1.385 0.551 0.005 4.91e+01
procedure_combinationsS 2.080 0.763 0.000 1.83e+01
has_seizuresTRUE 0.750 0.948 0.969 7.38e+13

Multivariable - Without scores

Setup

Feature selection

Create predictors.

# Define p-value threshold
pval_threshold <- 0.2

# Get the predictors of interest
predictors <-
  univariable_adjusted_recoded %>%
  bind_rows(univariable_unadjusted_recoded) %>%
  filter(`P-values` < pval_threshold) %>%
  pull(`Predictors`)

# Clean categorical variables
predictors <-
  predictors %>%
  str_replace("^([a-z0-9_]*)([A-Z].*)$", "\\1") %>%
  unique()

# Add adjustment variables
predictors <- unique(c(predictors))

# Remove unwanted predictors
predictors <- predictors[!predictors %in% unwanted_without_gradings]

# Print
print(predictors)
[1] "max_size_cm"              "is_eloquent_location"    
[3] "is_diffuse_nidus"         "has_hemorrhage"          
[5] "has_deep_venous_drainage" "has_deficit"             
[7] "procedure_combinations"   "has_associated_aneurysm" 

Model selection

Fit logistic.

# Define data
df <- df_multi %>% drop_na(modified_rankin_score_pretreatment, location)

# Create formula
model <- as.formula(paste(
  CHOSEN_OUTCOME,
  "~",
  paste(predictors, collapse = " + ")
))

fit <- glm(
  model,
  data = df,
  family = binomial()
)

# Save
fit_without_grading <- fit

Print results.

# Summarize model coefficients
fit_summary <-
  fit %>%
  broom::tidy(exponentiate = T, conf.int = T) %>%
  arrange(term == "(Intercept)", p.value) %>%
  rename(odds_ratio = estimate, z_value = statistic)

# Stylize and print
stylized <-
  fit_summary %>%
  rename(
    "Predictors" = term,
    "Odds Ratios (OR)" = odds_ratio,
    "SE" = std.error,
    "Z-scores" = z_value,
    "P-values" = p.value,
    "CI (low)" = conf.low,
    "CI (high)" = conf.high,
  ) %>%
  mutate(
    `Odds Ratios (OR)` = round(`Odds Ratios (OR)`, 2),
    `SE` = round(`SE`, 2),
    `CI (low)` = round(`CI (low)`, 2),
    `CI (high)` = round(`CI (high)`, 2),
    `P-values` = round(`P-values`, 3),
  )

# Print
stylized %>% sable()
Predictors Odds Ratios (OR) SE Z-scores P-values CI (low) CI (high)
max_size_cm 7.90e-01 0.10 -2.496 0.013 0.65 9.50e-01
procedure_combinationsMultimodal 1.28e+01 1.11 2.306 0.021 1.50 1.36e+02
is_eloquent_locationTRUE 3.50e-01 0.56 -1.893 0.058 0.10 9.70e-01
procedure_combinationsR 4.75e+00 1.15 1.353 0.176 0.51 5.44e+01
has_hemorrhageTRUE 1.72e+00 0.42 1.309 0.191 0.77 3.93e+00
has_deficitTRUE 6.40e-01 0.40 -1.105 0.269 0.29 1.41e+00
is_diffuse_nidusTRUE 6.10e-01 0.54 -0.905 0.365 0.21 1.81e+00
has_deep_venous_drainageTRUE 7.60e-01 0.44 -0.615 0.538 0.32 1.79e+00
has_associated_aneurysmTRUE 8.20e-01 0.51 -0.393 0.695 0.31 2.28e+00
procedure_combinationsS 3.08e+08 1426.07 0.014 0.989 0.00 1.58e+226
(Intercept) 2.30e+00 1.19 0.701 0.484 0.20 2.56e+01

Inference

Create robust CIs using the sandwich estimator.

# Calculate robust standard errors
robust_se <- sandwich::vcovHC(fit, type = "HC0")

# Compute robust confidence intervals
robust_ci <- lmtest::coeftest(fit, vcov. = robust_se)

Stylize results.

# Summarize model coefficients
fit_summary <-
  robust_ci %>%
  # broom does not support exponentiation after coeftest so do it manually
  broom::tidy(exponentiate = F, conf.int = T) %>%
  mutate(across(c(estimate, "conf.low", "conf.high"), exp)) %>%
  arrange(term == "(Intercept)", p.value) %>%
  rename(odds_ratio = estimate, z_value = statistic)

# Stylize and print
multivariable_pvalue <-
  fit_summary %>%
  rename(
    "Predictors" = term,
    "Odds Ratios (OR)" = odds_ratio,
    "SE" = std.error,
    "Z-scores" = z_value,
    "P-values" = p.value,
    "CI (low)" = conf.low,
    "CI (high)" = conf.high,
  ) %>%
  mutate(
    `Odds Ratios (OR)` = round(`Odds Ratios (OR)`, 2),
    `SE` = round(`SE`, 2),
    `CI (low)` = round(`CI (low)`, 2),
    `CI (high)` = round(`CI (high)`, 2),
    `P-values` = round(`P-values`, 3),
  )

# Print
multivariable_pvalue %>% sable()
Predictors Odds Ratios (OR) SE Z-scores P-values CI (low) CI (high)
procedure_combinationsS 3.08e+08 0.80 24.583 0.000 6.49e+07 1.47e+09
procedure_combinationsMultimodal 1.28e+01 0.82 3.103 0.002 2.56e+00 6.43e+01
max_size_cm 7.90e-01 0.10 -2.388 0.017 6.50e-01 9.60e-01
is_eloquent_locationTRUE 3.50e-01 0.56 -1.894 0.058 1.20e-01 1.04e+00
procedure_combinationsR 4.75e+00 0.88 1.777 0.076 8.50e-01 2.65e+01
has_hemorrhageTRUE 1.72e+00 0.42 1.282 0.200 7.50e-01 3.95e+00
has_deficitTRUE 6.40e-01 0.38 -1.147 0.251 3.00e-01 1.37e+00
is_diffuse_nidusTRUE 6.10e-01 0.56 -0.874 0.382 2.00e-01 1.84e+00
has_deep_venous_drainageTRUE 7.60e-01 0.47 -0.572 0.567 3.00e-01 1.92e+00
has_associated_aneurysmTRUE 8.20e-01 0.49 -0.403 0.687 3.10e-01 2.16e+00
(Intercept) 2.30e+00 1.00 0.835 0.404 3.30e-01 1.63e+01

Plot the coefficients with their confidence intervals.

robust_ci %>%
  # broom does not support exponentiation after coeftest so do it manually
  broom::tidy(exponentiate = F, conf.int = T) %>%
  mutate(across(c(estimate, "conf.low", "conf.high"), exp)) %>%
  filter(term != "(Intercept)") %>%
  arrange(desc(estimate)) %>%
  mutate(term = factor(term, levels = term)) %>%
  ggplot(aes(x = estimate, y = term, color = term)) +
  geom_pointrange(aes(xmin = conf.low, xmax = conf.high)) +
  geom_vline(xintercept = 1, linetype = 2, alpha = 0.5) +
  labs(x = "Odds ratio", y = NULL) +
  guides(color = F) +
  theme(
    axis.text = element_text(size = 12),
    axis.title = element_text(size = 13)
  )

# Dotwhisker was not giving me the correct plot
# dotwhisker::dwplot(
#   vline = geom_vline(xintercept = 1, linetype = 2, alpha =  0.5),
#   show_intercept = F
# )

Predict

Predict.

# Predicted probabilities
predicted_probs <- predict(fit, type = "response")

Plot histogram of predictions.

tibble(
  Predictions = predicted_probs,
  Truth = fit$model[, CHOSEN_OUTCOME]
) %>%
  pivot_longer(
    cols = everything(),
    names_to = "keys",
    values_to = "values"
  ) %>%
  mutate(keys = fct_inorder(keys)) %>%
  ggplot(aes(x = values, color = keys, fill = keys)) +
  geom_histogram(alpha = 0.7) +
  geom_vline(xintercept = 0.5, linetype = 2, alpha = 0.5) +
  facet_wrap(vars(keys), ncol = 1, scales = "fixed") +
  theme(
    axis.title.x = element_blank(),
    axis.title.y = element_blank(),
    axis.text = element_text(size = 12),
    strip.text.x = element_text(size = 13),
    legend.position = "none"
  )

Evaluate

How many examples were used?

# All examples
num_all_examples <- nrow(fit$data)
num_all_examples_positive <- sum(fit$data[, CHOSEN_OUTCOME], na.rm = T)
num_complete_cases <- nrow(fit$model)
num_complete_cases_positive <- sum(fit$model[, CHOSEN_OUTCOME], na.rm = T)

# Print
sprintf(
  "
  All examples = %s (%s with outcome)
  Fitted examples = %s (%s with outcome)",
  num_all_examples,
  num_all_examples_positive,
  num_complete_cases,
  num_complete_cases_positive
) %>% cat()

  All examples = 232 (182 with outcome)
  Fitted examples = 223 (175 with outcome)

Residual analysis.

# Residuals
residuals <- residuals(fit, type = "deviance")

# Plot residuals
plot(residuals)

Pseudo R-squared.

# Calculate Pseudo R-squared values
pseudo_r2 <- pscl::pR2(fit)
fitting null model for pseudo-r2
print(pseudo_r2)
     llh  llhNull       G2 McFadden     r2ML     r2CU 
 -81.926 -116.144   68.436    0.295    0.264    0.408 

ROC-AUC.

# Compute
auroc <- cvAUC::ci.cvAUC(
  predictions = predicted_probs,
  labels = fit$model[, CHOSEN_OUTCOME]
)

# Print
sprintf(
  "Cross-validated AUROC = %.3f (SE, %.3f; %s%% CI, %.3f-%.3f)",
  auroc$cvAUC,
  auroc$se,
  auroc$confidence * 100,
  auroc$ci[1],
  auroc$ci[2]
)
[1] "Cross-validated AUROC = 0.847 (SE, 0.028; 95% CI, 0.792-0.902)"

PR-AUC.

# Construct object
precrec_obj_tr <- precrec::evalmod(
  scores = predicted_probs,
  labels = fit$model[, CHOSEN_OUTCOME],
  calc_avg = F,
  cb_alpha = 0.05
)

# Print
# NOTE: Can use precrec::auc_ci to get CIs, but need to have multiple datasets
precrec_obj_tr %>%
  precrec::part() %>%
  precrec::pauc() %>%
  sable()
modnames dsids curvetypes paucs spaucs
m1 1 ROC 0.847 0.847
m1 1 PRC 0.954 0.954

ROC and PR curves.

# Plot
precrec_obj_tr %>% autoplot()

Another version of the ROC curve.

# Calculate ROC curve
roc_curve <- pROC::roc(fit$model[, CHOSEN_OUTCOME] ~ predicted_probs)

# Plot ROC curve
plot(roc_curve)

# Calculate AUC
auc <- pROC::auc(roc_curve)
print(auc)
Area under the curve: 0.847

Plot all performance measures.

# Construct object
precrec_obj_tr2 <- precrec::evalmod(
  scores = predicted_probs,
  labels = fit$model[, CHOSEN_OUTCOME],
  mode = "basic"
)

# Plot
precrec_obj_tr2 %>% autoplot()

Confusion matrix.

Kappa statistic: 0 = No agreement; 0.1-0.2 = Slight agreement; 0.2-0.4 = Fair agreement; 0.4-0.6 = Moderate agreement; 0.6-0.8 = Substantial agreement; 0.8 - 1 = Near perfect agreement

No Information Rate (NIR): The proportion of the largest observed class - for example, if we had 35 patients and 23 patients had the outcome, this is the largest class and the NIR is 23/35 = 0.657. The larger the deviation of Accuracy from NIR the more confident we are that the model is not just choosing the largest class.

# Scores
pred <- factor(ifelse(predicted_probs > 0.5, "Event", "No Event"))
truth <- factor(ifelse(fit$model[, CHOSEN_OUTCOME], "Event", "No Event"))

# Count values per category
xtab <- table(
  scores = pred,
  labels = truth
)

# Compute
confusion_matrix <-
  caret::confusionMatrix(
    xtab,
    positive = "Event",
    prevalence = NULL, # Provide prevalence as a proportion here if you want
    mode = "sens_spec"
  )

# Print
confusion_matrix
Confusion Matrix and Statistics

          labels
scores     Event No Event
  Event      165       27
  No Event    10       21
                                       
               Accuracy : 0.834        
                 95% CI : (0.779, 0.88)
    No Information Rate : 0.785        
    P-Value [Acc > NIR] : 0.04050      
                                       
                  Kappa : 0.436        
                                       
 Mcnemar's Test P-Value : 0.00853      
                                       
            Sensitivity : 0.943        
            Specificity : 0.438        
         Pos Pred Value : 0.859        
         Neg Pred Value : 0.677        
             Prevalence : 0.785        
         Detection Rate : 0.740        
   Detection Prevalence : 0.861        
      Balanced Accuracy : 0.690        
                                       
       'Positive' Class : Event        
                                       

Variable importance

Plot importance (based on magnitude of z-score).

# Calculate importance
varimp <-
  fit %>%
  caret::varImp() %>%
  as_tibble(rownames = "rn") %>%
  mutate(Predictor = factor(rn) %>% fct_reorder(Overall)) %>%
  dplyr::select(Predictor, Importance = Overall)

# Plot variable importance
varimp %>%
  ggplot(aes(Predictor, Importance, fill = Importance)) +
  geom_bar(stat = "identity", alpha = 0.9, width = 0.65) +
  coord_flip() +
  guides(fill = F) +
  labs(y = "\nVariable Importance", x = NULL) +
  scale_fill_viridis_c() +
  theme(
    panel.grid.major.y = element_blank(),
    panel.grid.minor.y = element_blank(),
    panel.border = element_blank(),
    axis.text.x = element_text(margin = margin(t = 7), size = 12),
    axis.text.y = element_text(margin = margin(r = -10), size = 12),
    axis.title = element_text(size = 13)
  )


Multivariable - Without components

Feature selection

Create predictors.

# Define p-value threshold
pval_threshold <- 0.2

# Get the predictors of interest
predictors <-
  univariable_unadjusted_recoded %>%
  bind_rows(univariable_adjusted) %>%
  filter(`P-values` < pval_threshold) %>%
  pull(`Predictors`)

# Clean categorical variables
predictors <-
  predictors %>%
  str_replace("^([a-z0-9_]*)([A-Z].*)$", "\\1") %>%
  unique()

# Add adjustment variables
predictors <- unique(c(predictors))

# Remove unwanted predictors
predictors <- predictors[!predictors %in% unwanted_with_gradings]

# Print
print(predictors)
[1] "spetzler_martin_grade"   "is_diffuse_nidus"       
[3] "has_hemorrhage"          "location"               
[5] "has_deficit"             "procedure_combinations" 
[7] "is_surgery"              "has_associated_aneurysm"

Model selection

Fit logistic.

# Define data
df <- df_multi

# Create formula
model <- as.formula(paste(
  CHOSEN_OUTCOME,
  "~",
  paste(predictors, collapse = " + ")
))

fit <- glm(
  model,
  data = df,
  family = binomial()
)

fit_with_grading <- fit

Print results.

# Summarize model coefficients
fit_summary <-
  fit %>%
  broom::tidy(exponentiate = T, conf.int = T) %>%
  arrange(term == "(Intercept)", p.value) %>%
  rename(odds_ratio = estimate, z_value = statistic)

# Stylize and print
stylized <-
  fit_summary %>%
  rename(
    "Predictors" = term,
    "Odds Ratios (OR)" = odds_ratio,
    "SE" = std.error,
    "Z-scores" = z_value,
    "P-values" = p.value,
    "CI (low)" = conf.low,
    "CI (high)" = conf.high,
  ) %>%
  mutate(
    `Odds Ratios (OR)` = round(`Odds Ratios (OR)`, 2),
    `SE` = round(`SE`, 2),
    `CI (low)` = round(`CI (low)`, 2),
    `CI (high)` = round(`CI (high)`, 2),
    `P-values` = round(`P-values`, 3),
  )

# Print
stylized %>% sable()
Predictors Odds Ratios (OR) SE Z-scores P-values CI (low) CI (high)
spetzler_martin_grade 5.00e-01 0.24 -2.894 0.004 0.31 7.90e-01
procedure_combinationsMultimodal 1.99e+01 1.34 2.233 0.026 1.68 3.65e+02
procedure_combinationsR 7.88e+00 1.37 1.512 0.131 0.62 1.49e+02
locationDeep 2.70e-01 0.88 -1.502 0.133 0.03 1.25e+00
has_hemorrhageTRUE 1.80e+00 0.41 1.451 0.147 0.82 4.04e+00
locationHemispheric 2.90e-01 0.86 -1.427 0.154 0.04 1.31e+00
has_deficitTRUE 6.20e-01 0.41 -1.178 0.239 0.27 1.39e+00
is_diffuse_nidusTRUE 6.20e-01 0.54 -0.876 0.381 0.22 1.82e+00
has_associated_aneurysmTRUE 7.40e-01 0.50 -0.587 0.557 0.28 2.04e+00
procedure_combinationsS 4.29e+08 1410.71 0.014 0.989 0.00 2.89e+218
is_surgeryTRUE
(Intercept) 6.62e+00 1.48 1.276 0.202 0.41 1.61e+02

Inference

Create robust CIs using the sandwich estimator.

# Calculate robust standard errors
robust_se <- sandwich::vcovHC(fit, type = "HC0")

# Compute robust confidence intervals
robust_ci <- lmtest::coeftest(fit, vcov. = robust_se)

Stylize results.

# Summarize model coefficients
fit_summary <-
  robust_ci %>%
  # broom does not support exponentiation after coeftest so do it manually
  broom::tidy(exponentiate = F, conf.int = T) %>%
  mutate(across(c(estimate, "conf.low", "conf.high"), exp)) %>%
  arrange(term == "(Intercept)", p.value) %>%
  rename(odds_ratio = estimate, z_value = statistic)

# Stylize and print
multivariable_pvalue <-
  fit_summary %>%
  rename(
    "Predictors" = term,
    "Odds Ratios (OR)" = odds_ratio,
    "SE" = std.error,
    "Z-scores" = z_value,
    "P-values" = p.value,
    "CI (low)" = conf.low,
    "CI (high)" = conf.high,
  ) %>%
  mutate(
    `Odds Ratios (OR)` = round(`Odds Ratios (OR)`, 2),
    `SE` = round(`SE`, 2),
    `CI (low)` = round(`CI (low)`, 2),
    `CI (high)` = round(`CI (high)`, 2),
    `P-values` = round(`P-values`, 3),
  )

# Print
multivariable_pvalue %>% sable()
Predictors Odds Ratios (OR) SE Z-scores P-values CI (low) CI (high)
procedure_combinationsS 4.29e+08 0.81 24.547 0.000 8.78e+07 2.10e+09
procedure_combinationsMultimodal 1.99e+01 0.84 3.565 0.000 3.84e+00 1.03e+02
spetzler_martin_grade 5.00e-01 0.23 -3.007 0.003 3.20e-01 7.90e-01
procedure_combinationsR 7.88e+00 0.91 2.280 0.023 1.34e+00 4.65e+01
locationDeep 2.70e-01 0.86 -1.535 0.125 5.00e-02 1.44e+00
has_hemorrhageTRUE 1.80e+00 0.40 1.461 0.144 8.20e-01 3.97e+00
locationHemispheric 2.90e-01 0.85 -1.460 0.144 6.00e-02 1.53e+00
has_deficitTRUE 6.20e-01 0.41 -1.180 0.238 2.70e-01 1.38e+00
is_diffuse_nidusTRUE 6.20e-01 0.53 -0.888 0.375 2.20e-01 1.77e+00
has_associated_aneurysmTRUE 7.40e-01 0.50 -0.587 0.557 2.80e-01 2.00e+00
(Intercept) 6.62e+00 1.07 1.775 0.076 8.20e-01 5.34e+01

Plot the coefficients with their confidence intervals.

robust_ci %>%
  # broom does not support exponentiation after coeftest so do it manually
  broom::tidy(exponentiate = F, conf.int = T) %>%
  mutate(across(c(estimate, "conf.low", "conf.high"), exp)) %>%
  filter(term != "(Intercept)") %>%
  arrange(desc(estimate)) %>%
  mutate(term = factor(term, levels = term)) %>%
  ggplot(aes(x = estimate, y = term, color = term)) +
  geom_pointrange(aes(xmin = conf.low, xmax = conf.high)) +
  geom_vline(xintercept = 1, linetype = 2, alpha = 0.5) +
  labs(x = "Odds ratio", y = NULL) +
  guides(color = F) +
  theme(
    axis.text = element_text(size = 12),
    axis.title = element_text(size = 13)
  )

# Dotwhisker was not giving me the correct plot
# dotwhisker::dwplot(
#   vline = geom_vline(xintercept = 1, linetype = 2, alpha =  0.5),
#   show_intercept = F
# )

Predict

Predict.

# Predicted probabilities
predicted_probs <- predict(fit, type = "response")

Plot histogram of predictions.

tibble(
  Predictions = predicted_probs,
  Truth = fit$model[, CHOSEN_OUTCOME]
) %>%
  pivot_longer(
    cols = everything(),
    names_to = "keys",
    values_to = "values"
  ) %>%
  mutate(keys = fct_inorder(keys)) %>%
  ggplot(aes(x = values, color = keys, fill = keys)) +
  geom_histogram(alpha = 0.7) +
  geom_vline(xintercept = 0.5, linetype = 2, alpha = 0.5) +
  facet_wrap(vars(keys), ncol = 1, scales = "fixed") +
  theme(
    axis.title.x = element_blank(),
    axis.title.y = element_blank(),
    axis.text = element_text(size = 12),
    strip.text.x = element_text(size = 13),
    legend.position = "none"
  )

Evaluate

How many examples were used?

# All examples
num_all_examples <- nrow(fit$data)
num_all_examples_positive <- sum(fit$data[, CHOSEN_OUTCOME], na.rm = T)
num_complete_cases <- nrow(fit$model)
num_complete_cases_positive <- sum(fit$model[, CHOSEN_OUTCOME], na.rm = T)

# Print
sprintf(
  "
  All examples = %s (%s with outcome)
  Fitted examples = %s (%s with outcome)",
  num_all_examples,
  num_all_examples_positive,
  num_complete_cases,
  num_complete_cases_positive
) %>% cat()

  All examples = 236 (184 with outcome)
  Fitted examples = 223 (175 with outcome)

Residual analysis.

# Residuals
residuals <- residuals(fit, type = "deviance")

# Plot residuals
plot(residuals)

Pseudo R-squared.

# Calculate Pseudo R-squared values
pseudo_r2 <- pscl::pR2(fit)
fitting null model for pseudo-r2
print(pseudo_r2)
     llh  llhNull       G2 McFadden     r2ML     r2CU 
 -80.851 -116.144   70.587    0.304    0.271    0.419 

ROC-AUC.

# Compute
auroc <- cvAUC::ci.cvAUC(
  predictions = predicted_probs,
  labels = fit$model[, CHOSEN_OUTCOME]
)

# Print
sprintf(
  "Cross-validated AUROC = %.3f (SE, %.3f; %s%% CI, %.3f-%.3f)",
  auroc$cvAUC,
  auroc$se,
  auroc$confidence * 100,
  auroc$ci[1],
  auroc$ci[2]
)
[1] "Cross-validated AUROC = 0.847 (SE, 0.029; 95% CI, 0.791-0.904)"

PR-AUC.

# Construct object
precrec_obj_tr <- precrec::evalmod(
  scores = predicted_probs,
  labels = fit$model[, CHOSEN_OUTCOME],
  calc_avg = F,
  cb_alpha = 0.05
)

# Print
# NOTE: Can use precrec::auc_ci to get CIs, but need to have multiple datasets
precrec_obj_tr %>%
  precrec::part() %>%
  precrec::pauc() %>%
  sable()
modnames dsids curvetypes paucs spaucs
m1 1 ROC 0.847 0.847
m1 1 PRC 0.954 0.954

ROC and PR curves.

# Plot
precrec_obj_tr %>% autoplot()

Another version of the ROC curve.

# Calculate ROC curve
roc_curve <- pROC::roc(fit$model[, CHOSEN_OUTCOME] ~ predicted_probs)

# Plot ROC curve
plot(roc_curve)

# Calculate AUC
auc <- pROC::auc(roc_curve)
print(auc)
Area under the curve: 0.847

Plot all performance measures.

# Construct object
precrec_obj_tr2 <- precrec::evalmod(
  scores = predicted_probs,
  labels = fit$model[, CHOSEN_OUTCOME],
  mode = "basic"
)

# Plot
precrec_obj_tr2 %>% autoplot()

Confusion matrix.

Kappa statistic: 0 = No agreement; 0.1-0.2 = Slight agreement; 0.2-0.4 = Fair agreement; 0.4-0.6 = Moderate agreement; 0.6-0.8 = Substantial agreement; 0.8 - 1 = Near perfect agreement

No Information Rate (NIR): The proportion of the largest observed class - for example, if we had 35 patients and 23 patients had the outcome, this is the largest class and the NIR is 23/35 = 0.657. The larger the deviation of Accuracy from NIR the more confident we are that the model is not just choosing the largest class.

# Scores
pred <- factor(ifelse(predicted_probs > 0.5, "Event", "No Event"))
truth <- factor(ifelse(fit$model[, CHOSEN_OUTCOME], "Event", "No Event"))

# Count values per category
xtab <- table(
  scores = pred,
  labels = truth
)

# Compute
confusion_matrix <-
  caret::confusionMatrix(
    xtab,
    positive = "Event",
    prevalence = NULL, # Provide prevalence as a proportion here if you want
    mode = "sens_spec"
  )

# Print
confusion_matrix
Confusion Matrix and Statistics

          labels
scores     Event No Event
  Event      165       25
  No Event    10       23
                                        
               Accuracy : 0.843         
                 95% CI : (0.789, 0.888)
    No Information Rate : 0.785         
    P-Value [Acc > NIR] : 0.0182        
                                        
                  Kappa : 0.476         
                                        
 Mcnemar's Test P-Value : 0.0180        
                                        
            Sensitivity : 0.943         
            Specificity : 0.479         
         Pos Pred Value : 0.868         
         Neg Pred Value : 0.697         
             Prevalence : 0.785         
         Detection Rate : 0.740         
   Detection Prevalence : 0.852         
      Balanced Accuracy : 0.711         
                                        
       'Positive' Class : Event         
                                        

Variable importance

Plot importance (based on magnitude of z-score).

# Calculate importance
varimp <-
  fit %>%
  caret::varImp() %>%
  as_tibble(rownames = "rn") %>%
  mutate(Predictor = factor(rn) %>% fct_reorder(Overall)) %>%
  dplyr::select(Predictor, Importance = Overall)

# Plot variable importance
varimp %>%
  ggplot(aes(Predictor, Importance, fill = Importance)) +
  geom_bar(stat = "identity", alpha = 0.9, width = 0.65) +
  coord_flip() +
  guides(fill = F) +
  labs(y = "\nVariable Importance", x = NULL) +
  scale_fill_viridis_c() +
  theme(
    panel.grid.major.y = element_blank(),
    panel.grid.minor.y = element_blank(),
    panel.border = element_blank(),
    axis.text.x = element_text(margin = margin(t = 7), size = 12),
    axis.text.y = element_text(margin = margin(r = -10), size = 12),
    axis.title = element_text(size = 13)
  )


Multivariable - With interactions

Presents with hemorrhage

Fit logistic.

# Define data
df <- df_multi

# Create formula
model <- as.formula(paste(
  CHOSEN_OUTCOME,
  "~",
  "modified_rankin_score_pretreatment*has_hemorrhage +
  spetzler_martin_grade*has_hemorrhage +
  age_at_first_treatment_yrs*has_hemorrhage"
))

fit <- glm(
  model,
  data = df,
  family = binomial()
)

Print results.

# Summarize model coefficients
fit_summary <-
  fit %>%
  broom::tidy(exponentiate = T, conf.int = T) %>%
  arrange(term == "(Intercept)", p.value) %>%
  rename(odds_ratio = estimate, z_value = statistic)

# Stylize and print
stylized <-
  fit_summary %>%
  rename(
    "Predictors" = term,
    "Odds Ratios (OR)" = odds_ratio,
    "SE" = std.error,
    "Z-scores" = z_value,
    "P-values" = p.value,
    "CI (low)" = conf.low,
    "CI (high)" = conf.high,
  ) %>%
  mutate(
    `Odds Ratios (OR)` = round(`Odds Ratios (OR)`, 2),
    `SE` = round(`SE`, 2),
    `CI (low)` = round(`CI (low)`, 2),
    `CI (high)` = round(`CI (high)`, 2),
    `P-values` = round(`P-values`, 3),
  )

# Print
stylized %>% sable()
Predictors Odds Ratios (OR) SE Z-scores P-values CI (low) CI (high)
spetzler_martin_grade 0.38 0.25 -3.778 0.000 0.22 0.61
modified_rankin_score_pretreatment:has_hemorrhageTRUE 0.55 0.36 -1.650 0.099 0.26 1.10
modified_rankin_score_pretreatment 1.57 0.32 1.417 0.157 0.86 3.05
has_hemorrhageTRUE 6.59 1.73 1.093 0.274 0.22 204.87
age_at_first_treatment_yrs 1.05 0.06 0.848 0.397 0.93 1.19
has_hemorrhageTRUE:age_at_first_treatment_yrs 0.95 0.10 -0.557 0.577 0.78 1.15
has_hemorrhageTRUE:spetzler_martin_grade 1.08 0.36 0.220 0.826 0.54 2.20
(Intercept) 18.21 1.17 2.482 0.013 2.20 224.85

Model comparison

Likelihood ratio

No statistical evidence that the grading-based model fits the data better than the grading-independent model based on the likelihood ratio test.

anova(fit_without_grading, fit_with_grading, test = "LR")
Resid. Df Resid. Dev Df Deviance Pr(>Chi)
212 164
212 162 0 2.15

Special cases

SM < 4

Select features.

# Define p-value threshold
pval_threshold <- 0.2

# Get the predictors of interest
predictors <-
  univariable_adjusted_recoded %>%
  bind_rows(univariable_unadjusted_recoded) %>%
  filter(`P-values` < pval_threshold) %>%
  pull(`Predictors`)

# Clean categorical variables
predictors <-
  predictors %>%
  str_replace("^([a-z0-9_]*)([A-Z].*)$", "\\1") %>%
  unique()

# Add adjustment variables
predictors <- unique(c(predictors))

# Remove unwanted predictors
predictors <- predictors[!predictors %in% unwanted_with_gradings]

# Replace SM by its binary variant
predictors <- c(predictors, "is_spetzler_martin_grade_less_than_4")
predictors <- predictors[!predictors == "spetzler_martin_grade"]

# Print
print(predictors)
[1] "is_diffuse_nidus"                    
[2] "has_hemorrhage"                      
[3] "location"                            
[4] "has_deficit"                         
[5] "procedure_combinations"              
[6] "has_associated_aneurysm"             
[7] "is_spetzler_martin_grade_less_than_4"

Fit logistic.

# Define data
df <- df_multi %>% drop_na(modified_rankin_score_pretreatment, location)

# Create formula
model <- as.formula(paste(
  CHOSEN_OUTCOME,
  "~",
  paste(predictors, collapse = " + ")
))

fit <- glm(
  model,
  data = df,
  family = binomial()
)

# Save
fit_without_grading <- fit

Print results.

# Summarize model coefficients
fit_summary <-
  fit %>%
  broom::tidy(exponentiate = T, conf.int = T) %>%
  arrange(term == "(Intercept)", p.value) %>%
  rename(odds_ratio = estimate, z_value = statistic)

# Stylize and print
stylized <-
  fit_summary %>%
  rename(
    "Predictors" = term,
    "Odds Ratios (OR)" = odds_ratio,
    "SE" = std.error,
    "Z-scores" = z_value,
    "P-values" = p.value,
    "CI (low)" = conf.low,
    "CI (high)" = conf.high,
  ) %>%
  mutate(
    `Odds Ratios (OR)` = round(`Odds Ratios (OR)`, 2),
    `SE` = round(`SE`, 2),
    `CI (low)` = round(`CI (low)`, 2),
    `CI (high)` = round(`CI (high)`, 2),
    `P-values` = round(`P-values`, 3),
  )

# Print
stylized %>% sable()
Predictors Odds Ratios (OR) SE Z-scores P-values CI (low) CI (high)
procedure_combinationsMultimodal 1.42e+01 1.19 2.235 0.025 1.53 1.91e+02
locationDeep 1.90e-01 0.88 -1.913 0.056 0.02 8.50e-01
is_diffuse_nidusTRUE 4.20e-01 0.52 -1.662 0.097 0.15 1.17e+00
is_spetzler_martin_grade_less_than_4TRUE 2.07e+00 0.45 1.618 0.106 0.85 5.03e+00
procedure_combinationsR 6.37e+00 1.22 1.515 0.130 0.64 9.03e+01
locationHemispheric 2.70e-01 0.86 -1.504 0.133 0.03 1.20e+00
has_hemorrhageTRUE 1.82e+00 0.40 1.488 0.137 0.83 4.04e+00
has_deficitTRUE 6.50e-01 0.40 -1.072 0.284 0.30 1.44e+00
has_associated_aneurysmTRUE 7.10e-01 0.49 -0.685 0.493 0.28 1.91e+00
procedure_combinationsS 1.95e+08 878.36 0.022 0.983 0.00 2.17e+138
(Intercept) 7.30e-01 1.29 -0.247 0.805 0.06 1.07e+01

Create robust CIs using the sandwich estimator.

# Calculate robust standard errors
robust_se <- sandwich::vcovHC(fit, type = "HC0")

# Compute robust confidence intervals
robust_ci <- lmtest::coeftest(fit, vcov. = robust_se)

Stylize results.

# Summarize model coefficients
fit_summary <-
  robust_ci %>%
  # broom does not support exponentiation after coeftest so do it manually
  broom::tidy(exponentiate = F, conf.int = T) %>%
  mutate(across(c(estimate, "conf.low", "conf.high"), exp)) %>%
  arrange(term == "(Intercept)", p.value) %>%
  rename(odds_ratio = estimate, z_value = statistic)

# Stylize and print
multivariable_pvalue <-
  fit_summary %>%
  rename(
    "Predictors" = term,
    "Odds Ratios (OR)" = odds_ratio,
    "SE" = std.error,
    "Z-scores" = z_value,
    "P-values" = p.value,
    "CI (low)" = conf.low,
    "CI (high)" = conf.high,
  ) %>%
  mutate(
    `Odds Ratios (OR)` = round(`Odds Ratios (OR)`, 2),
    `SE` = round(`SE`, 2),
    `CI (low)` = round(`CI (low)`, 2),
    `CI (high)` = round(`CI (high)`, 2),
    `P-values` = round(`P-values`, 3),
  )

# Print
multivariable_pvalue %>% sable()
Predictors Odds Ratios (OR) SE Z-scores P-values CI (low) CI (high)
procedure_combinationsS 1.95e+08 0.84 22.756 0.000 3.77e+07 1.01e+09
procedure_combinationsMultimodal 1.42e+01 0.83 3.191 0.001 2.78e+00 7.24e+01
procedure_combinationsR 6.37e+00 0.89 2.076 0.038 1.11e+00 3.66e+01
locationDeep 1.90e-01 0.93 -1.804 0.071 3.00e-02 1.16e+00
is_diffuse_nidusTRUE 4.20e-01 0.52 -1.674 0.094 1.50e-01 1.16e+00
is_spetzler_martin_grade_less_than_4TRUE 2.07e+00 0.45 1.608 0.108 8.50e-01 5.04e+00
has_hemorrhageTRUE 1.82e+00 0.40 1.476 0.140 8.20e-01 4.01e+00
locationHemispheric 2.70e-01 0.91 -1.431 0.152 5.00e-02 1.62e+00
has_deficitTRUE 6.50e-01 0.40 -1.074 0.283 3.00e-01 1.43e+00
has_associated_aneurysmTRUE 7.10e-01 0.49 -0.692 0.489 2.80e-01 1.85e+00
(Intercept) 7.30e-01 1.02 -0.315 0.753 1.00e-01 5.31e+00

Plot the coefficients with their confidence intervals.

robust_ci %>%
  # broom does not support exponentiation after coeftest so do it manually
  broom::tidy(exponentiate = F, conf.int = T) %>%
  mutate(across(c(estimate, "conf.low", "conf.high"), exp)) %>%
  filter(term != "(Intercept)") %>%
  arrange(desc(estimate)) %>%
  mutate(term = factor(term, levels = term)) %>%
  ggplot(aes(x = estimate, y = term, color = term)) +
  geom_pointrange(aes(xmin = conf.low, xmax = conf.high)) +
  geom_vline(xintercept = 1, linetype = 2, alpha = 0.5) +
  labs(x = "Odds ratio", y = NULL) +
  guides(color = F) +
  theme(
    axis.text = element_text(size = 12),
    axis.title = element_text(size = 13)
  )

# Dotwhisker was not giving me the correct plot
# dotwhisker::dwplot(
#   vline = geom_vline(xintercept = 1, linetype = 2, alpha =  0.5),
#   show_intercept = F
# )

Had surgery

Select features.

# Define p-value threshold
pval_threshold <- 0.2

# Get the predictors of interest
predictors <-
  univariable_adjusted_recoded %>%
  bind_rows(univariable_unadjusted_recoded) %>%
  filter(`P-values` < pval_threshold) %>%
  pull(`Predictors`)

# Clean categorical variables
predictors <-
  predictors %>%
  str_replace("^([a-z0-9_]*)([A-Z].*)$", "\\1") %>%
  unique()

# Add adjustment variables
predictors <- unique(c(predictors))

# Remove unwanted predictors
predictors <- predictors[!predictors %in% unwanted_with_gradings]

# Replace SM by its binary variant
predictors <- c(predictors, "is_surgery")
predictors <- predictors[!predictors == "procedure_combinations"]

# Print
print(predictors)
[1] "spetzler_martin_grade"   "is_diffuse_nidus"       
[3] "has_hemorrhage"          "location"               
[5] "has_deficit"             "has_associated_aneurysm"
[7] "is_surgery"             

Fit logistic.

# Define data
df <- df_multi %>% drop_na(modified_rankin_score_pretreatment, location)

# Create formula
model <- as.formula(paste(
  CHOSEN_OUTCOME,
  "~",
  paste(predictors, collapse = " + ")
))

fit <- glm(
  model,
  data = df,
  family = binomial()
)

# Save
fit_without_grading <- fit

Print results.

# Summarize model coefficients
fit_summary <-
  fit %>%
  broom::tidy(exponentiate = T, conf.int = T) %>%
  arrange(term == "(Intercept)", p.value) %>%
  rename(odds_ratio = estimate, z_value = statistic)

# Stylize and print
stylized <-
  fit_summary %>%
  rename(
    "Predictors" = term,
    "Odds Ratios (OR)" = odds_ratio,
    "SE" = std.error,
    "Z-scores" = z_value,
    "P-values" = p.value,
    "CI (low)" = conf.low,
    "CI (high)" = conf.high,
  ) %>%
  mutate(
    `Odds Ratios (OR)` = round(`Odds Ratios (OR)`, 2),
    `SE` = round(`SE`, 2),
    `CI (low)` = round(`CI (low)`, 2),
    `CI (high)` = round(`CI (high)`, 2),
    `P-values` = round(`P-values`, 3),
  )

# Print
stylized %>% sable()
Predictors Odds Ratios (OR) SE Z-scores P-values CI (low) CI (high)
spetzler_martin_grade 5.50e-01 0.23 -2.648 0.008 0.35 8.40e-01
locationDeep 2.70e-01 0.85 -1.540 0.124 0.04 1.21e+00
has_hemorrhageTRUE 1.81e+00 0.39 1.510 0.131 0.84 3.94e+00
locationHemispheric 4.10e-01 0.83 -1.075 0.282 0.06 1.75e+00
has_deficitTRUE 6.60e-01 0.40 -1.049 0.294 0.30 1.45e+00
is_diffuse_nidusTRUE 6.50e-01 0.53 -0.818 0.413 0.23 1.85e+00
has_associated_aneurysmTRUE 7.10e-01 0.47 -0.741 0.459 0.28 1.80e+00
is_surgeryTRUE 1.14e+07 863.18 0.019 0.985 0.00 3.61e+130
(Intercept) 5.60e+01 1.13 3.548 0.000 7.54 6.95e+02

Create robust CIs using the sandwich estimator.

# Calculate robust standard errors
robust_se <- sandwich::vcovHC(fit, type = "HC0")

# Compute robust confidence intervals
robust_ci <- lmtest::coeftest(fit, vcov. = robust_se)

Stylize results.

# Summarize model coefficients
fit_summary <-
  robust_ci %>%
  # broom does not support exponentiation after coeftest so do it manually
  broom::tidy(exponentiate = F, conf.int = T) %>%
  mutate(across(c(estimate, "conf.low", "conf.high"), exp)) %>%
  arrange(term == "(Intercept)", p.value) %>%
  rename(odds_ratio = estimate, z_value = statistic)

# Stylize and print
multivariable_pvalue <-
  fit_summary %>%
  rename(
    "Predictors" = term,
    "Odds Ratios (OR)" = odds_ratio,
    "SE" = std.error,
    "Z-scores" = z_value,
    "P-values" = p.value,
    "CI (low)" = conf.low,
    "CI (high)" = conf.high,
  ) %>%
  mutate(
    `Odds Ratios (OR)` = round(`Odds Ratios (OR)`, 2),
    `SE` = round(`SE`, 2),
    `CI (low)` = round(`CI (low)`, 2),
    `CI (high)` = round(`CI (high)`, 2),
    `P-values` = round(`P-values`, 3),
  )

# Print
multivariable_pvalue %>% sable()
Predictors Odds Ratios (OR) SE Z-scores P-values CI (low) CI (high)
is_surgeryTRUE 1.14e+07 0.38 43.191 0.000 5.47e+06 2.39e+07
spetzler_martin_grade 5.50e-01 0.21 -2.897 0.004 3.70e-01 8.20e-01
locationDeep 2.70e-01 0.78 -1.673 0.094 6.00e-02 1.25e+00
has_hemorrhageTRUE 1.81e+00 0.39 1.516 0.130 8.40e-01 3.87e+00
locationHemispheric 4.10e-01 0.79 -1.137 0.255 9.00e-02 1.91e+00
has_deficitTRUE 6.60e-01 0.40 -1.038 0.299 3.00e-01 1.45e+00
is_diffuse_nidusTRUE 6.50e-01 0.51 -0.846 0.398 2.40e-01 1.76e+00
has_associated_aneurysmTRUE 7.10e-01 0.48 -0.718 0.473 2.80e-01 1.82e+00
(Intercept) 5.60e+01 1.16 3.480 0.001 5.80e+00 5.40e+02

Plot the coefficients with their confidence intervals.

robust_ci %>%
  # broom does not support exponentiation after coeftest so do it manually
  broom::tidy(exponentiate = F, conf.int = T) %>%
  mutate(across(c(estimate, "conf.low", "conf.high"), exp)) %>%
  filter(term != "(Intercept)") %>%
  arrange(desc(estimate)) %>%
  mutate(term = factor(term, levels = term)) %>%
  ggplot(aes(x = estimate, y = term, color = term)) +
  geom_pointrange(aes(xmin = conf.low, xmax = conf.high)) +
  geom_vline(xintercept = 1, linetype = 2, alpha = 0.5) +
  labs(x = "Odds ratio", y = NULL) +
  guides(color = F) +
  theme(
    axis.text = element_text(size = 12),
    axis.title = element_text(size = 13)
  )

# Dotwhisker was not giving me the correct plot
# dotwhisker::dwplot(
#   vline = geom_vline(xintercept = 1, linetype = 2, alpha =  0.5),
#   show_intercept = F
# )

Grade 5s

df_uni %>%
  drop_na(spetzler_martin_grade) %>%
  count(is_poor_outcome, spetzler_martin_grade, name = "num_patients") %>%
  mutate(
    pct_patients = scales::percent(num_patients / sum(num_patients), 1)
  ) %>%
  sable()
is_poor_outcome spetzler_martin_grade num_patients pct_patients
FALSE 1 38 17%
FALSE 2 45 20%
FALSE 3 65 29%
FALSE 4 37 16%
FALSE 5 17 7%
TRUE 1 2 1%
TRUE 2 3 1%
TRUE 3 6 3%
TRUE 4 7 3%
TRUE 5 8 4%

Only consider those with SM grade 5.

df_uni %>%
  filter(spetzler_martin_grade == 5) %>%
  count(is_poor_outcome, spetzler_martin_grade, name = "num_patients") %>%
  mutate(
    pct_patients = scales::percent(num_patients / sum(num_patients), 1)
  ) %>%
  sable()
is_poor_outcome spetzler_martin_grade num_patients pct_patients
FALSE 5 17 68%
TRUE 5 8 32%

Write

Setup

Create the necessary directories.

# Get today's date
today <- Sys.Date()
today <- format(today, "%Y-%m-%d")

# Create folder names
base_folder <- file.path(DST_DIRNAME, ANALYSIS_NAME)
date_folder <- file.path(base_folder, today)
csv_folder <- file.path(date_folder, "csv")
pdf_folder <- file.path(date_folder, "pdf")
html_folder <- file.path(date_folder, "html")

# Create folders
suppressWarnings(dir.create(base_folder))
suppressWarnings(dir.create(date_folder))
suppressWarnings(dir.create(csv_folder))
suppressWarnings(dir.create(pdf_folder))
suppressWarnings(dir.create(html_folder))

# Print folder names
print(base_folder)
print(date_folder)
print(csv_folder)
print(pdf_folder)
print(html_folder)
[1] "../../outputs//predictive-analytics/is-obliterated"
[1] "../../outputs//predictive-analytics/is-obliterated/2024-07-29"
[1] "../../outputs//predictive-analytics/is-obliterated/2024-07-29/csv"
[1] "../../outputs//predictive-analytics/is-obliterated/2024-07-29/pdf"
[1] "../../outputs//predictive-analytics/is-obliterated/2024-07-29/html"

Figures

Write all figures.

# Save
# ggsave(
#   file.path(pdf_folder, "histogram_num_days.pdf"),
#   plot = plots$histogram_num_days,
#   width = 6,
#   height = 6
# )
# # Start graphic device
# pdf(
#   file = file.path(pdf_folder, "forest-plot_frequentist.pdf"),
#   width = 10,
#   height = 15
# )
#
# # Plot
# plots$forest_plot_frequentist
#
# # Shut down device
# dev.off()

Tables

Write all tables.

# # Arguments
# df <- tables$desc_stats_cohorts_cv_prolaio
# filepath_csv <- file.path(csv_folder, "desc-stats_by-cohort_cov.csv")
# filepath_html <- file.path(html_folder, "desc-stats_by-cohort_cov.html")
#
# # Save as CSV
# write_csv(
#   x = df,
#   file = filepath_csv
# )
#
# # Save as HTML
# # - Save pdf does not work with webshot
# # - It works with pagedown but not as pretty as desired
# df %>%
#   sable() %>%
#   kableExtra::save_kable(file = filepath_html)
write_csv(
  univariable_unadjusted,
  file.path(csv_folder, "univariable_unadjusted.csv")
)
write_csv(
  univariable_adjusted,
  file.path(csv_folder, "univariable_adjusted.csv")
)
write_csv(
  multivariable_pvalue,
  file.path(csv_folder, "multivariable_pvalue.csv")
)

Data

Write all data.

# # Arguments
# df <- study
# filename <- paste0(ANALYSIS_NAME, "_", today, ".csv")
# filepath_csv <- file.path(DST_BUCKET, dst_dirname_data, filename)
#
# # Save as CSV
# write_csv(
#   x = df,
#   file = filepath_csv
# )

Reproducibility

Linting and styling

# Style current file
styler::style_file(
  path = rstudioapi::getSourceEditorContext()$path,
  style = styler::tidyverse_style
)

# Lint current file
lintr::lint(rstudioapi::getSourceEditorContext()$path)

Dependency management

# Clean up project of libraries not in use
# (use prompt = FALSE to avoid the interactive session)
# (packages can only be removed in interactive mode b/c this is destructive)
renv::clean(prompt = TRUE)

# Update lock file with new packages
renv::snapshot()

Containerization

# Only run this if option is set to TRUE
if (UPDATE_DOCKERFILE) {
  # Create a dockerfile from the session info
  my_dockerfile <- containerit::dockerfile(from = sessionInfo(), env = ls())
  # Write file
  write(my_dockerfile, file = "~/Dockerfile")
  # Print
  print(my_dockerfile)
}

Documentation

Session Info

R version 4.3.2 (2023-10-31)
Platform: aarch64-apple-darwin20 (64-bit)
Running under: macOS Sonoma 14.5

Matrix products: default
BLAS: /Library/Frameworks/R.framework/Versions/4.3-arm64/Resources/lib/libRblas.0.dylib
LAPACK: /Library/Frameworks/R.framework/Versions/4.3-arm64/Resources/lib/libRlapack.dylib;
LAPACK version 3.11.0

attached base packages:
[1] parallel stats graphics grDevices utils datasets methods
[8] base

other attached packages:
[1] grf_2.3.2 lubridate_1.9.3 forcats_1.0.0
[4] stringr_1.5.1 dplyr_1.1.4 purrr_1.0.2
[7] readr_2.1.5 tidyr_1.3.1 tibble_3.2.1
[10] tidyverse_2.0.0 selectiveInference_1.2.5 MASS_7.3-60
[13] adaptMCMC_1.5 coda_0.19-4.1 survival_3.5-7
[16] intervals_0.15.4 glmnet_4.1-8 Matrix_1.6-1.1
[19] magrittr_2.0.3 ggplot2_3.5.0 kableExtra_1.4.0
[22] rmdformats_1.0.4 knitr_1.45

loaded via a namespace (and not attached):
[1] pROC_1.18.5 tcltk_4.3.2 sandwich_3.1-0
[4] rlang_1.1.3 e1071_1.7-14 matrixStats_1.3.0
[7] compiler_4.3.2 systemfonts_1.0.5 vctrs_0.6.5
[10] reshape2_1.4.4 cvAUC_1.1.4 pkgconfig_2.0.3
[13] shape_1.4.6.1 crayon_1.5.2 summarytools_1.0.1
[16] fastmap_1.1.1 backports_1.4.1 magick_2.8.3
[19] labeling_0.4.3 pander_0.6.5 utf8_1.2.4
[22] rmarkdown_2.25 prodlim_2023.08.28 tzdb_0.4.0
[25] bit_4.0.5 xfun_0.42 cachem_1.0.8
[28] jsonlite_1.8.8 recipes_1.0.10 highr_0.10
[31] pryr_0.1.6 broom_1.0.6 R6_2.5.1
[34] bslib_0.6.1 stringi_1.8.3 parallelly_1.37.1
[37] rpart_4.1.21 lmtest_0.9-40 jquerylib_0.1.4
[40] Rcpp_1.0.12 bookdown_0.39 assertthat_0.2.1
[43] iterators_1.0.14 future.apply_1.11.2 zoo_1.8-12
[46] pscl_1.5.9 base64enc_0.1-3 nnet_7.3-19
[49] splines_4.3.2 timechange_0.3.0 tidyselect_1.2.1
[52] rstudioapi_0.15.0 yaml_2.3.8 timeDate_4032.109
[55] codetools_0.2-19 listenv_0.9.1 lattice_0.21-9
[58] precrec_0.14.4 plyr_1.8.9 withr_3.0.0
[61] ROCR_1.0-11 evaluate_0.23 future_1.33.2
[64] proxy_0.4-27 xml2_1.3.6 BiocManager_1.30.22
[67] pillar_1.9.0 renv_1.0.4 stats4_4.3.2
[70] checkmate_2.3.1 foreach_1.5.2 generics_0.1.3
[73] vroom_1.6.5 hms_1.1.3 munsell_0.5.0
[76] scales_1.3.0 globals_0.16.3 class_7.3-22
[79] glue_1.7.0 tools_4.3.2 data.table_1.15.4
[82] ModelMetrics_1.2.2.2 gower_1.0.1 rapportools_1.1
[85] grid_4.3.2 ipred_0.9-14 colorspace_2.1-0
[88] nlme_3.1-163 patchwork_1.2.0 cli_3.6.2
[91] fansi_1.0.6 viridisLite_0.4.2 lava_1.8.0
[94] svglite_2.1.3 gtable_0.3.4 ggcorrplot_0.1.4.1
[97] sass_0.4.8 digest_0.6.34 caret_6.0-94
[100] farver_2.1.1 htmltools_0.5.7 lifecycle_1.0.4
[103] hardhat_1.4.0 bit64_4.0.5

References

[[1]]
Scheidegger A, , (2024). _adaptMCMC: Implementation of a Generic
Adaptive Monte Carlo Markov Chain Sampler_. R package version 1.5,
<https://CRAN.R-project.org/package=adaptMCMC>.

[[2]]
R Core Team (2023). _R: A Language and Environment for Statistical
Computing_. R Foundation for Statistical Computing, Vienna, Austria.
<https://www.R-project.org/>.

[[3]]
Plummer M, Best N, Cowles K, Vines K (2006). "CODA: Convergence
Diagnosis and Output Analysis for MCMC." _R News_, *6*(1), 7-11.
<https://journal.r-project.org/archive/>.

[[4]]
R Core Team (2023). _R: A Language and Environment for Statistical
Computing_. R Foundation for Statistical Computing, Vienna, Austria.
<https://www.R-project.org/>.

[[5]]
Wickham H, François R, Henry L, Müller K, Vaughan D (2023). _dplyr: A
Grammar of Data Manipulation_. R package version 1.1.4,
<https://CRAN.R-project.org/package=dplyr>.

[[6]]
Wickham H (2023). _forcats: Tools for Working with Categorical
Variables (Factors)_. R package version 1.0.0,
<https://CRAN.R-project.org/package=forcats>.

[[7]]
Wickham H (2016). _ggplot2: Elegant Graphics for Data Analysis_.
Springer-Verlag New York. ISBN 978-3-319-24277-4,
<https://ggplot2.tidyverse.org>.

[[8]]
Friedman J, Tibshirani R, Hastie T (2010). "Regularization Paths for
Generalized Linear Models via Coordinate Descent." _Journal of
Statistical Software_, *33*(1), 1-22. doi:10.18637/jss.v033.i01
<https://doi.org/10.18637/jss.v033.i01>.

Simon N, Friedman J, Tibshirani R, Hastie T (2011). "Regularization
Paths for Cox's Proportional Hazards Model via Coordinate Descent."
_Journal of Statistical Software_, *39*(5), 1-13.
doi:10.18637/jss.v039.i05 <https://doi.org/10.18637/jss.v039.i05>.

Tay JK, Narasimhan B, Hastie T (2023). "Elastic Net Regularization
Paths for All Generalized Linear Models." _Journal of Statistical
Software_, *106*(1), 1-31. doi:10.18637/jss.v106.i01
<https://doi.org/10.18637/jss.v106.i01>.

[[9]]
R Core Team (2023). _R: A Language and Environment for Statistical
Computing_. R Foundation for Statistical Computing, Vienna, Austria.
<https://www.R-project.org/>.

[[10]]
R Core Team (2023). _R: A Language and Environment for Statistical
Computing_. R Foundation for Statistical Computing, Vienna, Austria.
<https://www.R-project.org/>.

[[11]]
Tibshirani J, Athey S, Sverdrup E, Wager S (2024). _grf: Generalized
Random Forests_. R package version 2.3.2,
<https://CRAN.R-project.org/package=grf>.

[[12]]
Bourgon R (2023). _intervals: Tools for Working with Points and
Intervals_. R package version 0.15.4,
<https://CRAN.R-project.org/package=intervals>.

[[13]]
Zhu H (2024). _kableExtra: Construct Complex Table with 'kable' and
Pipe Syntax_. R package version 1.4.0,
<https://CRAN.R-project.org/package=kableExtra>.

[[14]]
Xie Y (2023). _knitr: A General-Purpose Package for Dynamic Report
Generation in R_. R package version 1.45, <https://yihui.org/knitr/>.

Xie Y (2015). _Dynamic Documents with R and knitr_, 2nd edition.
Chapman and Hall/CRC, Boca Raton, Florida. ISBN 978-1498716963,
<https://yihui.org/knitr/>.

Xie Y (2014). "knitr: A Comprehensive Tool for Reproducible Research in
R." In Stodden V, Leisch F, Peng RD (eds.), _Implementing Reproducible
Computational Research_. Chapman and Hall/CRC. ISBN 978-1466561595.

[[15]]
Grolemund G, Wickham H (2011). "Dates and Times Made Easy with
lubridate." _Journal of Statistical Software_, *40*(3), 1-25.
<https://www.jstatsoft.org/v40/i03/>.

[[16]]
Bache S, Wickham H (2022). _magrittr: A Forward-Pipe Operator for R_. R
package version 2.0.3, <https://CRAN.R-project.org/package=magrittr>.

[[17]]
Venables WN, Ripley BD (2002). _Modern Applied Statistics with S_,
Fourth edition. Springer, New York. ISBN 0-387-95457-0,
<https://www.stats.ox.ac.uk/pub/MASS4/>.

[[18]]
Bates D, Maechler M, Jagan M (2023). _Matrix: Sparse and Dense Matrix
Classes and Methods_. R package version 1.6-1.1,
<https://CRAN.R-project.org/package=Matrix>.

[[19]]
R Core Team (2023). _R: A Language and Environment for Statistical
Computing_. R Foundation for Statistical Computing, Vienna, Austria.
<https://www.R-project.org/>.

[[20]]
R Core Team (2023). _R: A Language and Environment for Statistical
Computing_. R Foundation for Statistical Computing, Vienna, Austria.
<https://www.R-project.org/>.

[[21]]
Wickham H, Henry L (2023). _purrr: Functional Programming Tools_. R
package version 1.0.2, <https://CRAN.R-project.org/package=purrr>.

[[22]]
Wickham H, Hester J, Bryan J (2024). _readr: Read Rectangular Text
Data_. R package version 2.1.5,
<https://CRAN.R-project.org/package=readr>.

[[23]]
Barnier J (2022). _rmdformats: HTML Output Formats and Templates for
'rmarkdown' Documents_. R package version 1.0.4,
<https://CRAN.R-project.org/package=rmdformats>.

[[24]]
Tibshirani R, Tibshirani R, Taylor J, Loftus J, Reid S, Markovic J
(2019). _selectiveInference: Tools for Post-Selection Inference_. R
package version 1.2.5,
<https://CRAN.R-project.org/package=selectiveInference>.

[[25]]
R Core Team (2023). _R: A Language and Environment for Statistical
Computing_. R Foundation for Statistical Computing, Vienna, Austria.
<https://www.R-project.org/>.

[[26]]
Wickham H (2023). _stringr: Simple, Consistent Wrappers for Common
String Operations_. R package version 1.5.1,
<https://CRAN.R-project.org/package=stringr>.

[[27]]
Therneau T (2023). _A Package for Survival Analysis in R_. R package
version 3.5-7, <https://CRAN.R-project.org/package=survival>.

Terry M. Therneau, Patricia M. Grambsch (2000). _Modeling Survival
Data: Extending the Cox Model_. Springer, New York. ISBN 0-387-98784-3.

[[28]]
Müller K, Wickham H (2023). _tibble: Simple Data Frames_. R package
version 3.2.1, <https://CRAN.R-project.org/package=tibble>.

[[29]]
Wickham H, Vaughan D, Girlich M (2024). _tidyr: Tidy Messy Data_. R
package version 1.3.1, <https://CRAN.R-project.org/package=tidyr>.

[[30]]
Wickham H, Averick M, Bryan J, Chang W, McGowan LD, François R,
Grolemund G, Hayes A, Henry L, Hester J, Kuhn M, Pedersen TL, Miller E,
Bache SM, Müller K, Ooms J, Robinson D, Seidel DP, Spinu V, Takahashi
K, Vaughan D, Wilke C, Woo K, Yutani H (2019). "Welcome to the
tidyverse." _Journal of Open Source Software_, *4*(43), 1686.
doi:10.21105/joss.01686 <https://doi.org/10.21105/joss.01686>.

[[31]]
R Core Team (2023). _R: A Language and Environment for Statistical
Computing_. R Foundation for Statistical Computing, Vienna, Austria.
<https://www.R-project.org/>.